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(1)  Objectives:  Most  complex  networks,  such  as  the  World  Wide  Web,  grow  over  time,  and 
such  growth  is  usually  characterized  by  highly  distributed  phenomena.  However,  the 
complexity  and  distributed  nature  of  those  networks  does  not  imply  that  its  growth  is 
chaotic  or  unpredictable.  Just  as  natural  scientists  discover  laws  and  create  models  for 
their  fields,  so  one  can  ,  in  principle,  find  empirical  regularities  and  develop  explanatory 
accounts  of  changes  in  the  network.  In  the  case  of  the  World  Wide  Web,  such  predictive 
knowledge  would  be  valuable  for  anticipating  computing  needs,  social  trends,  and 
market  opportunities. 

We  can  now  obtain  digital  traces  of  human  social  interaction  with  time  stamp 
information  in  a  wide  variety  of  on-line  settings,  such  as  Blog  (Weblog)  communications, 
email  exchanges,  etc..  Such  social  interaction  can  be  naturally  represented  as  a  large-scale 
social  network  that  grows  over  time,  where  nodes  (vertices)  correspond  to  people  or  some 
social  entities,  and  links  (edges)  correspond  to  social  interaction  between  them.  Clearly 
these  growing  social  networks  reflect  complex  social  structures  and  distributed  social 
trends.  Thus,  it  seems  worth  an  effort  to  attempt  to  find  empirical  regularities  and  develop 
explanatory  accounts  of  changes  in  the  social  networks.  Namely,  such  attempts  would  be 
valuable  for  understanding  social  structures  and  trends,  and  inspiring  us  the  discovery  of 
new  knowledge  and  insights  into  underlying  social  interaction.  We  extensively  carry  out 
research  on  computational  methods  for  the  discovery  of  knowledge  from  growing  social 
networks. 

(2)  Status  of  effort:  We  have  uncovered  that  probabilistic  models  of  information  diffusion 
processes  over  social  networks  play  an  essential  role  for  the  discovery  of  knowledge.  Thus, 
we  carried  out  research  on  mathematical  models  for  enabling  us  to  explain,  control  and 
visualize  wider  variety  of  information  diffusion  processes.  Especially,  it  is  highly 
expected  that  this  kind  of  mathematical  studies  using  large-scale  networks  such  as  a  blog 
communication  network  can  bridge  a  gap  between  empirical  social  networks  analyses  and 
fundamental  mathematics.  In  the  first  year,  we  derived  a  very  efficient  method  for 
minimizing  the  propagation  of  undesirable  things  by  blocking  a  limited  number  of  links  in 
a  network.  In  addition,  we  developed  an  effective  visualization  method  for  understanding 

a  complex  network,  in  particular  its  dynamical  aspect  such  as  information  diffusion 
process.  Furthermore,  we  proposed  a  new  scheme  for  empirical  study  to  explore  the 
behavioral  characteristics  of  representative  information  diffusion  models.  In  the  second 
year,  we  developed  an  effective  method  for  ranking  influential  nodes  in  complex  social 
networks  by  estimating  diffusion  probabilities  from  observed  information  diffusion  data 
using  the  popular  independent  cascade  (IC)  model.  In  addition,  we  derived  a  very  efficient 
method  for  discovering  the  influential  nodes  in  a  social  network  under  the 
susceptible/infected/susceptible  (SIS)  model.  Furthermore,  we  proposed  a  new  method  for 
learning  continuous-time  information  diffusion  model  for  social  behavioral  data  analysis. 

(3)  Abstract:  First,  we  addressed  the  problem  of  minimizing  the  propagation  of  undesirable 
things,  such  as  computer  viruses  or  malicious  rumors,  by  blocking  a  limited  number  of 
links  in  a  network,  a  converse  problem  to  the  influence  maximization  problem  of  finding 
the  most  influential  nodes  in  a  social  network  for  information  diffusion.  This 
minimization  problem  is  another  approach  to  the  problem  of  preventing  the  spread  of 
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information.  We  derived  a  method  for  efficiently  finding  a  good  approximate  solution  to 
this  problem  based  on  a  naturally  greedy  strategy.  Using  large  real  networks,  we 
demonstrated  experimentally  that  the  proposed  method  significantly  outperforms 
conventional  link-removal  methods.  We  also  showed  that  unlike  the  strategy  of  removing 
nodes,  blocking  links  between  nodes  with  high  out-degrees  is  not  necessarily  effective. 
Second,  we  addressed  the  problem  of  effective  visualization  for  understanding  a  complex 
network,  in  particular  its  dynamical  aspect  such  as  information  diffusion  process.  Existing 
node  embedding  methods  are  all  based  solely  on  the  network  topology  and  sometimes 
produce  counter-intuitive  visualization.  We  developed  a  new  node  embedding  method 
based  on  conditional  probability  that  explicitly  addresses  diffusion  process  using  either 
the  IC  (Independent  Cascade)  or  LT  (Linear  Threshold)  models  as  a  cross-entropy 
minimization  problem,  together  with  two  label  assignment  strategies  that  can  be 
simultaneously  adopted.  Numerical  experiments  were  performed  on  two  large  real 
networks,  one  represented  by  a  directed  graph  and  the  other  by  an  undirected  graph.  The 
results  clearly  demonstrated  the  advantage  of  the  developed  method  over  conventional 
spring  model  and  topology-based  cross-entropy  methods,  especially  for  the  case  of 
directed  networks. 

Third,  we  attempted  to  answer  a  question  "What  does  information  diffusion  model  tell 
about  social  network  structure?"  To  this  end,  we  proposed  a  new  scheme  for  empirical 
study  to  explore  the  behavioral  characteristics  of  representative  information  diffusion 
models  such  as  the  IC  model  and  the  LT  model  on  large  networks  with  different 
community  structure.  To  change  community  structure,  we  first  construct  a  GR 
(Generalized  Random)  network  from  an  originally  observed  network  by  randomly 
rewiring  links  of  the  original  network  without  changing  the  degree  of  each  node.  Then  we 
plot  the  expected  number  of  influenced  nodes  based  on  an  information  diffusion  model 
with  respect  to  the  degree  of  each  information  source  node.  Using  large  real  networks,  we 
empirically  found  that  our  proposal  scheme  uncovered  a  number  of  new  insights.  Most 
importantly,  we  showed  that  community  structure  more  strongly  affects  information 
diffusion  processes  of  the  IC  model  than  those  of  the  LT  model.  Moreover,  by  visualizing 
these  networks,  we  gave  some  evidence  that  our  claims  are  reasonable. 

Forth,  we  addressed  the  problem  of  ranking  influential  nodes  in  complex  social  networks 
by  estimating  diffusion  probabilities  from  observed  information  diffusion  data  using  the 
IC  model.  For  this  purpose  we  formulated  the  likelihood  for  information  diffusion  data 
which  is  a  set  of  time  sequence  data  of  active  nodes  and  propose  an  iterative  method  to 
search  for  the  probabilities  that  maximizes  this  likelihood.  We  apply  this  to  two  real  world 
social  networks  in  the  simplest  setting  where  the  probability  is  uniform  for  all  the  links, 
and  show  that  the  accuracy  of  the  probability  estimation  is  outstandingly  good,  and  further 
show  that  the  proposed  method  can  predict  the  high  ranked  influential  nodes  much  more 
accurately  than  the  well  studied  conventional  four  heuristic  methods. 

Fifth,  we  addressed  the  problem  of  efficiently  discovering  the  influential  nodes  in  a  social 
network  under  the  SIS  model,  a  diffusion  model  where  nodes  are  allowed  to  be  activated 
multiple  times.  The  computational  complexity  drastically  increases  because  of  this 
multiple  activation  property.  We  solved  this  problem  by  constructing  a  layered  graph  from 
the  original  social  network  with  each  layer  added  on  top  as  the  time  proceeds,  and 
applying  the  bond  percolation  with  pruning  and  burnout  strategies.  We  experimentally 
demonstrated  that  the  proposed  method  gives  much  better  solutions  than  the  conventional 
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methods  that  are  solely  based  on  the  notion  of  centrality  for  social  network  analysis  using 
two  large-scale  real-world  networks  (a  blog  network  and  a  wikipedia  network).  We  further 
showed  that  the  computational  complexity  of  the  proposed  method  is  much  smaller  than 
the  conventional  naive  probabilistic  simulation  method  by  a  theoretical  analysis  and 
confirm  this  by  experimentation.  The  properties  of  the  influential  nodes  discovered  were 
substantially  different  from  those  identified  by  the  centrality-based  heuristic  methods. 

Finally,  we  addressed  the  problem  of  estimating  the  parameters  for  a  continuous  time 
delay  independent  cascade  (CTIC)  model,  a  more  realistic  model  for  information  diffusion 
in  complex  social  network,  from  the  observed  information  diffusion  data.  For  this  purpose 
we  formulated  the  rigorous  likelihood  to  obtain  the  observed  data  and  propose  an  iterative 
method  to  obtain  the  parameters  (time-delay  and  diffusion)  by  maximizing  this  likelihood. 
We  applied  this  method  first  to  the  problem  of  ranking  influential  nodes  using  the  network 
structure  taken  from  two  real  world  web  datasets  and  showed  that  the  proposed  method 
can  predict  the  high  ranked  influential  nodes  much  more  accurately  than  the  well  studied 
conventional  four  heuristic  methods,  and  second  to  the  problem  of  evaluating  how 
different  topics  propagate  in  different  ways  using  a  real  world  blog  data  and  showed  that 
there  are  indeed  differences  in  the  propagation  speed  among  different  topics. 
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Community  Analysis  of  Influential  Nodes  for  Information  Diffusion 

on  a  Social  Network 


Masahiro  Kimura,  Kazumasa  Yamakawa,  Kazumi  Saito,  and  Hiroshi  Motoda 


Abstract — We  consider  the  problem  of  finding  influential 
nodes  for  information  diffusion  on  a  social  network  under 
the  independent  cascade  model.  It  is  known  that  the  greedy 
algorithm  can  give  a  good  approximate  solution  for  the  prob¬ 
lem.  Aiming  to  obtain  efficient  methods  for  finding  better 
approximate  solutions,  we  explore  what  structual  feature  of 
the  underlying  network  is  relevant  to  the  greedy  solution  that  is 
the  approximate  solution  by  the  greedy  algorithm.  We  focus  on 
the  SR-community  strucutre,  and  analyze  the  greedy  solution 
in  terms  of  the  SR-community  structure.  Using  real  large 
social  networks,  we  experimentally  demonstrate  that  the  SR- 
community  structure  can  be  more  strongly  correlated  with  the 
greedy  solution  than  the  community  structure  introduced  by 
Newman  and  Leicht. 

I.  Introduction 

Recently,  considerable  attention  has  been  devoted  to  social 
network  analysis  [9],  [14],  [1],  [2],  [8],  [13],  [7],  since  the 
rise  of  the  Internet  and  the  World  Wide  Web  has  enabled  us 
to  collect  real  large  social  networks.  Here,  a  social  network 
is  the  network  of  relationships  and  interactions  among  social 
entities  such  as  individuals,  organizations  and  groups.  Ex¬ 
amples  include  blog  networks,  collaboration  networks,  and 
email  networks. 

A  social  network  plays  an  important  role  for  the  spread  of 
information  since  a  piece  of  information  can  propagate  from 
one  node  to  another  node  through  a  link  on  the  network 
in  the  form  of  “word-of-mouth”  communication  [3],  Thus, 
it  is  an  important  research  issue  to  find  influential  nodes 
for  information  diffusion  on  a  social  network  in  terms  of 
sociology  and  “viral  marketing”.  In  fact,  researchers  [5],  [6] 
have  recently  studied  a  combinatorial  optimization  problem 
called  the  influence  maximization  problem  on  a  network  un¬ 
der  the  independent  cascade  (IC)  model  that  is  a  widely-used 
fundamental  probabilistic  model  of  information  diffusion. 
Here,  the  influence  maximization  problem  of  size  k  is  the 
problem  of  extracting  a  set  of  k  nodes  to  target  for  initial 
activation  such  that  it  yields  the  largest  expected  spread  of 
information,  where  k  is  a  given  positive  integer.  Kempe  et 
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al.  [5]  experimentally  showed  on  large  collaboration  net¬ 
works  that  the  greedy  algorithm  can  give  a  good  approximate 
solution  for  the  influence  maximization  problem  under  the 
IC  model.  We  refer  to  the  approximate  solution  obtained 
by  the  greedy  algorithm  as  the  greedy  solution.  Using  an 
analysis  framework  based  on  submodular  functions,  Kempe 
et  al.  [5]  mathematically  proved  a  performance  guarantee  of 
the  greedy  solution.  Moreover,  Kimura  et  al.  [6]  presented 
a  method  of  efficiently  estimating  the  greedy  solution  on 
the  basis  of  bond  percolation  and  graph  theory.  However, 
it  is  desirable  to  construct  efficient  methods  of  obtaining 
better  approximate  solutions  for  the  influence  maximization 
problem  on  a  network  under  the  IC  model.  Towards  this  aim, 
it  is  important  to  understand  what  structural  feature  of  the 
underlying  network  is  correlated  with  the  greedy  solution. 

As  a  structual  feature  of  a  given  network,  we  focus  on  the 
SR-community  structure  U  =  (Um',  m  =  1,2,3,  •••)  [15] 
that  is  a  sequence  of  densely  connected  sets  of  nodes  in 
the  network.  Here,  the  /nth  SR-community  Um  is  defined  as 
the  set  of  nodes  in  the  network  that  maximizes  the  average 
number  of  links  within  it  after  removing  all  the  links  within 
Uj,  (j  =  0,  •  •  •  ,m  —  1),  where  Uq  is  the  empty  set  0.  In 
this  paper,  we  analyze  the  greedy  solution  for  the  influence 
maximization  problem  under  the  IC  model  in  terms  of  the 
SR-community  structure  U.  For  the  influence  maximization 
problem  of  size  k,  we  extract  the  minimal  sequence  of  SR- 
communities  in  U,  U k  =  (Um;  m  =  1,  ■  •  •  ,  M *. ) ,  such  that 
it  covers  the  greedy  solution,  and  investigate  the  similar¬ 
ity  between  the  set  of  nodes  influenced  by  each  node  i>j 
in  the  greedy  solution  and  the  SR-community  in  Uk  that 
corresponds  to  the  node  t/j.  On  the  basis  of  this  manner,  we 
quantify  the  strength  of  the  correlation  between  the  greedy 
solution  and  the  SR-community  struture.  Using  real  large 
social  networks,  we  experimentally  demonstrate  that  unlike 
the  community  structure  introduced  by  Newman  and  Leicht 
[12],  the  SR-community  structure  can  be  strongly  correlated 
wirh  the  greedy  solution. 

II.  Influential  Nodes  for  Information  Diffusion 

Throughout  this  paper,  we  consider  a  social  network 
represented  by  an  undirected  graph,  and  discuss  the  spread  of 
a  certain  information  through  the  network  under  the  IC  model 
by  regarding  those  undirected  links  as  bidirectional  ones.  We 
call  nodes  active  if  they  have  accepted  the  information. 

A.  Independent  Cascade  Model 

We  define  the  IC  model.  In  this  model,  the  diffusion 
process  unfolds  in  discrete  time-steps  t  >  0,  and  it  is 


assumed  that  nodes  can  switch  from  being  inactive  to  being 
active,  but  cannot  switch  from  being  active  to  being  inactive. 
Given  an  initial  set  X  of  active  nodes,  we  assume  that  the 
nodes  in  X  have  first  become  active  at  step  0,  and  all  the 
other  nodes  are  inactive  at  step  0.  We  specify  a  real  value 
Pu,v  £  [0, 1]  for  each  directed  link  (it,  v)  in  advance.  Here, 
(3UtV  is  referred  to  as  the  propagation  probability  through 
link  ( u,v ). 

When  an  initial  set  X  of  active  nodes  is  given,  the 
diffusion  process  proceeds  in  the  following  way.  When  node 
u  first  becomes  active  at  step  t,  it  is  given  a  single  chance 
to  activate  each  currently  inactive  neighbor  v,  and  succeeds 
with  probability  (3u,v  If  u  succeeds,  then  v  will  become 
active  at  step  t  +  1.  If  multiple  parents  of  v  first  become 
active  at  step  i,  then  their  activation  attempts  are  sequenced 
in  an  arbitrary  order,  but  performed  at  step  t.  Whether  or  not 
u  succeeds,  it  cannot  make  any  further  attempts  to  activate 
v  in  subsequent  rounds.  The  process  terminates  if  no  more 
activations  are  possible. 

For  an  initial  active  set  X,  let  cr(X)  denote  the  expected 
number  of  active  nodes  at  the  end  of  the  random  process  in 
the  IC  model.  We  call  <r(X)  the  influence  degree  of  initial 
active  set  X. 


estimating  Sk  on  the  basis  of  bond  percolation  and  graph 
theory.  In  this  paper,  using  their  method,  we  estimate  the 
greedy  solution  Sk¬ 
ill.  SR-community  Structure 

In  this  section,  we  define  the  SR-community  structure,  and 
describe  a  method  for  efficiently  estimating  it  according  to 
the  work  of  Saito  et  al.  [15]. 

A.  Definition 

Let  A  be  the  adjacency  matrix  of  a  network,  and  let 
S  =  {l,---,N} 

be  the  set  of  all  the  nodes  in  the  network.  Namely,  each 
(i,  j)-element  of  the  adjacency  matrix,  denoted  by 
is  set  to  1  if  there  exists  a  link  (edge)  between  nodes  i  and 
j;  otherwise  0.  In  this  paper,  we  focus  on  undirected  graphs 
without  self-connections,  i.e.,  A(i,j )  =  A(j,i),  A(i,i )  =  0, 
(i,j  =  1,  •  •  •  ,  N ).  For  any  subset  of  nodes,  T  C  S',  we  can 
define  the  average  number  of  links  within  T  as  follows: 

om  =  5EE#  «> 

Z  ieT  j<ET  '  ' 


B.  Influence  Maximization  Problem 

We  consider  the  influence  maximization  problem  of  size 
k  under  the  IC  mode.  Let  S  be  the  set  of  all  the  nodes 
in  the  network.  The  problem  is  defined  as  follows:  Given 
a  positive  integer  k,  find  a  set  X'l  of  k  nodes  to  target  for 
initial  activation  such  that  cr(Xj!)  >  a(Y )  for  any  set  Y  of 
k  nodes.  To  approximately  solve  this  optimization  problem, 
we  consider  the  following  greedy  algorithm: 

1)  Set  X  <—  0. 

2)  for  i  =  1  to  k  do 

3)  Choose  a  node  Vi  £  V  maximizing  a(X  U  {u}), 

(v  £  S\X). 

4)  Set  X  X  U  {Vi}. 

5)  end  for 

Let  Sk  denote  the  set  of  k  nodes  obtained  by  this  algorithm. 
We  call  Sk  the  greedy  solution  of  the  influence  maximization 
problem  of  size  k. 

Using  large  collaboration  networks,  Kempe  et  al.  [5] 
experimetally  demonstrated  that  the  greedy  solution  Sk  out¬ 
performs  the  approximate  solutios  obtained  by  the  high- 
degree  and  centrality  heuristics  that  are  commonly  used  in 
the  sociology  literature.  It  is  also  known  that 

a(Sk)  >  (i  -  J)  <x(Xfe*), 

that  is,  a  performance  guarantee  of  the  greedy  solution  Sk 
is  obtained  [5],  For  any  initial  active  set  X,  a  good  estimate 
of  <r(X)  was  conventionally  obtained  by  simulating  the 
random  process  of  the  IC  model  many  times.  Thus,  any 
straightforward  method  to  estimate  the  greedy  solution  Sk 
needed  a  large  amount  of  computation  on  a  large  network. 
However,  Kimura  et  al.  [6]  gave  an  efficient  method  for 


where  |Tj  stands  for  the  number  of  elements  in  T.  First,  let 
U\  denote  the  subset  of  S  that  maximizes  the  average  number 
of  links  within  it  (see,  (1)).  Next,  for  the  network  constructed 
through  removing  all  the  links  within  U\  from  the  original 
network,  let  U2  denote  the  subset  of  S  that  maximizes  the 
average  number  of  links  within  it  (see,  (1)).  Next,  for  the 
network  constructed  thorugh  removing  all  the  links  within  U 1 
and  U2  from  the  original  network,  let  (/■>  denote  the  subset  of 
S  that  maximizes  the  average  number  of  links  within  it  (see, 
(1)).  By  repeatedly  performing  these  procedures,  we  define 
the  sequence  of  subsets  of  S, 

U  =  (Um;  m  =  1,2,3,  •••). 

Here,  U  is  called  the  SR-community  structure  of  the  orig¬ 
inal  network,  and  each  Um  is  referred  to  as  the  rntli  SR- 
community.  Note  that  the  SR-community  structure  U  repre¬ 
sents  a  structual  feature  of  the  network. 

In  the  case  of  a  large  network,  any  straightforward  method 
for  detecting  the  SR-community  structure  is  likely  to  suffer 
from  combinatorial  explosion.  To  cope  with  such  a  large 
network,  we  employ  the  method  presented  by  Saito  et 
al.  [15], 


B.  Relaxation  problem 

For  a  subset  T  of  S  ,  we  define  an  N  dimensional  indicator 
vector  q  by  setting  q{i)  =  1  if  i  £  T\  otherwise  q(i)  =  0. 
Then  we  can  rewrite  (1)  as  follows: 


G(q) 


1  qTAq 

2  qTq 


(2) 


where  qT  stands  for  a  transposed  vector  of  q.  Now  we 
consider  a  relaxation  problem  by  letting  q  take  continuous 
values.  Then,  according  to  the  Rayleigh-Ritz  theorem  [4], 


the  solution  of  maximizing  G(q)  is  given  by  the  principal 
eigenvector  q*  of  the  adjacency  matrix  A. 

In  order  to  obtain  the  eigenvector  q*,  we  employ  the 
following  procedure  based  on  the  power  iteration  [4], 

El.  Initialize  q(°)  =  (1,  •  •  •  ,  1)T,  and  set  r  <—  1; 

E2.  Calculate  q=  AqG-1)  and  qW  =  q/maxi§j; 

E3.  Terminate  if  max,;  1 9M(t)-g(-D(*)|  <e; 

E4.  Set  r  <—  r  +  1,  and  return  to  E2.. 

Here  a  small  positive  parameter  e  controls  the  termination 
condition,  and  we  can  obtain  the  final  solution  as  q*  =  q!  r> 
after  its  termination.  Since  all  the  elements  of  A  and  q:G 
have  non-negative  values,  we  can  guarantee  that  all  the 
elements  of  q  also  have  non-negative  values  after  any  number 
of  iterations.  Moreover,  due  to  the  scaling  operation  in  E2, 
we  can  guarantee  that  0  <  q(T\i)  <  1  for  any  r  and  i.  Thus 
we  consider  that  the  above  formulation  gives  one  of  desirable 
relaxation  solutions  to  the  original  problem. 


C.  Quantization  problem 

By  ranking  nodes  according  to  the  values  of  eigenvector  el¬ 
ements,  we  can  obtain  a  list  of  nodes,  R  =  [r(l),  •  •  •  ,  r(iV)], 
where  r(i)  stands  for  a  mapping  from  ranks  to  nodes.  Note 
that  q*(r(i))  >  q*(r(i  +  1))  for  any  i.  By  considering  a  set 
of  the  top  h  nodes. 


T{h )  =  {r(i)  :  i  =  1,  •  •  •  ,h},  (3) 


we  can  calculate  the  average  number  of  links  within  T(h) 
as  follows: 


h— 1  h 

gw -EE 

i—  1  j=i-\- 1 


A{r{i),r(j)) 

h 


(4) 


In  our  method,  instead  of  directly  solving  (1),  we  compute 
a  node  set  T{h*),  where  h*  maximizes  (4). 

In  order  to  efficiently  calculate  h*,  we  utilize  the  following 
update  formula: 


G(h  +  1)  =  G(h)  + 


A(fo  +  1)  -  G(fe) 
h  +  1 


(5) 


where  A (h  +  1)  stands  for  the  increment  by  adding  node 
r(h  +  1),  calculated  by 


h 

A(h+  1)  =  Y^A(r(j),r{h  +  1)).  (6) 

j=  i 

Note  that  G(l)  =  0.  The  above  procedure  can  be  summarized 
as  follows. 

FI.  Compute  r(i)  by  sorting  elements  of  q*; 

F2.  Calculate  G(2),  •  •  •  ,  G(N)  by  using  (5)  and  (6); 
F3.  Output  T(h*)  such  that  h*  =  argmaxjj  G(/i); 


D.  Detection  algorithm 

By  repeatedly  performing  the  above  procedures,  M  times, 
we  can  detect  M  densely  connected  portions  for  a  given 
network  as  follows. 

Gl.  Repeat  the  following  steps  for  m  =  1  to  M; 

G2.  Calculate  q*,  using  El  to  E4; 


G3.  Calculate  using  FI  to  F3; 

G4.  Set  A(i,j)  =  0  if 

Here,  the  number  M  of  communities  is  determined  by  a 
user.  We  estimate  the  mth  SR-community  Um  as  rf  *n  for  any 
integer  m  with  1  <  m  <  M. 


IV.  Community  Analysis  of  Influential  Nodes 

For  a  given  network,  we  consider  the  influence  maximiza¬ 
tion  problem  of  size  k  under  the  IC  model.  Let  Sk  =  i  = 
1,  •  ■  •  ,  k}  be  the  greedy  solution,  and  let  U  =  (Um\  m  = 
1, 2,  3,  •  •  • )  be  the  SR-community  structure  of  the  network. 
We  analyze  the  greedy  solution  Sk  in  terms  of  the  SR- 
community  structure  U. 

First,  we  extract  the  minimal  sequence  of  SR-communities 
in  U  such  that  it  covers  the  greedy  solution  Sk, 

G/,-  =  ( Um]  m  =  1,  -  •  ■  ,  Mk), 


that  is,  Mk  is  the  minimal  integer  M  satisfying 

M 

1J  Um  D  Sk. 

m= 1 


Note  that  Uk  can  be  regarded  as  a  rough  approximation  to 
the  greedy  solution  Sk ■  We  call  Mk  the  SR-covering  number 
of  the  greedy  solution  Sk-  For  any  Vi  £  Sk,  let  a(vi)  denote 
the  minimal  integer  m  satisfying  Um  9  u*.  Ua(Vi  j  is  reffered 
to  as  the  SR-community  that  corresponds  to  the  node  Vi . 

Next,  for  any  Vi  £  Sk  and  a  real  value  p  £  [0,1],  we 
consider  the  influence  set  H(vi,p )  of  Vi  with  probability  p. 
Here,  H(vi,p )  is  the  set  of  nodes  v  in  the  network  such 
that  when  {u,}  is  the  initial  active  set,  the  probability  that 
v  is  active  at  the  end  of  the  diffusion  process  under  the  IC 
model  is  more  than  p.  Note  that  Vi  £  H(i>i,p)  C  H(vi,p') 
if  0  <  p’  <  p  <  1. 

We  investigate  the  correlation  between  the  greedy  solution 
Sk  and  the  SR-community  structure  U .  In  terms  of  F- 
measure,  we  quantify  the  similarity  between  an  influence 
set  H(yi,p)  of  each  node  v,  in  the  greedy  solution  Sk  and 
the  SR-community  Ua (t>i)  that  correspond  to  v, ,  that  is,  we 
measure  how  close  the  sets  H(vi,p)  and  Uarv. )  are  by 


Fo(p-,Vi)  =  200 


H(vi,p)GUa(Vi) 

“t“  R(x{yi) 


(7) 


Moreover,  we  quantify  the  strength  of  the  correlation  be¬ 
tween  the  greedy  solution  Sk  and  the  SR-community  struc¬ 
ture  U  as  follows: 


F(h)  =  jY/F1(vi), 

i= 1 


(8) 


where 


Fi(vi) 


max  F0(p ;  Vi),  (i  =  1,  •  •  •  ,k). 

0<P<1 


V.  Experimental  Evaluation 

Using  real  large  networks,  we  experimentally  evaluate  the 
strength  of  the  correlation  between  the  greedy  solution  of 
the  influence  maximization  problem  under  the  IC  model  and 
the  SR-community  structure.  Let  Sk  =  {fi,  •  •  •  ,Vk}  be  the 
greedy  solution  for  the  influence  maximization  problem  of 
size  k. 

A.  Network  Datasets 

In  the  evaluation  experiments,  we  should  desirably  use 
large  networks  that  exhibit  many  of  the  key  features  of  real 
social  networks.  Here,  we  report  on  the  experimental  results 
for  two  different  datasets  of  such  real  networks. 

First,  we  employed  a  trackback  network  of  blogs,  since 
a  piece  of  information  can  propagate  from  one  blog  author 
to  another  blog  author  through  a  trackback.  Since  bloggers 
discuss  various  topics  and  establish  mutual  communications 
by  putting  trackbacks  on  each  other’s  blogs,  we  regarded  a 
link  created  by  a  trackback  as  a  biderectional  link  for  simplic¬ 
ity.  By  tracing  ten  steps  ahead  the  trackbacks  from  the  blog 
of  the  theme  “JR  Fukuchiyama  Line  Derailment  Collision” 
in  the  site  “goo”1,  we  collected  a  large  connected  trackback 
network  in  May,  2005.  This  network  was  an  undirected  graph 
of  12,047  nodes  and  39,960  links.  This  network  showed 
the  so-called  “power-law”  degree  distribution  that  most  real 
large  networks  exhibit.  Here,  the  degree  distribution  is  the 
distribution  of  the  number  of  links  for  every  node.  We  refer 
to  this  network  data  as  the  blog  network  dataset. 

Next,  we  employed  a  network  of  people  that  was  derived 
from  the  “list  of  people”  within  Japanese  Wikipedia.  Specif¬ 
ically,  we  extracted  the  maximal  connected  component  of 
the  undirected  graph  obtained  by  linking  two  people  in  the 
“list  of  people”  if  they  co-occur  in  six  or  more  Wikipedia 
pages.  We  refer  to  this  network  data  as  the  Wikipedia  network 
dataset.  Here,  the  total  numbers  of  nodes  and  links  were 
9,481  and  122,522,  respectively. 

Newman  and  Park  [11]  observed  that  social  networks 
represented  as  undirected  graphs  generally  have  the  following 
two  statistical  properties  unlike  non-social  networks.  First, 
they  show  positive  correlations  between  the  degrees  of  ad¬ 
jacent  nodes.  Second,  they  have  much  higer  values  of  the 
clustering  coefficient  than  the  corresponding  configuration 
models  (i.e.,  random  network  models).  Here,  the  clustering 
coefficient  C  for  an  undirected  graph  is  defined  by 

3  x  number  of  triangles  on  the  graph 
number  of  connected  triples  of  nodes’ 

where  a  “triangle”  means  a  set  of  three  nodes  each  of  which 
is  connected  to  each  of  the  others,  and  a  “connected  triple” 
means  a  node  connected  directly  to  an  unordered  pair  of 
others.  Note  that  in  terms  of  sociology,  C  measures  the 
probability  that  two  of  your  friends  will  also  be  friends  of 
one  another.  Given  a  degree  distribution,  the  corresponding 
configuration  model  of  random  network  is  defined  as  the 

'http : //blog . goo . ne . jp/usertheme/ 


ensemble  of  all  possible  graphs  that  possess  the  degree  dis¬ 
tribution,  with  each  having  equal  weight.  The  value  of  C  for 
the  configuration  model  can  be  exactly  calculated  [10].  For 
the  Wikipedia  network,  the  value  of  C  of  the  corresponding 
configuration  model  was  0.046,  while  the  actual  measured 
value  of  C  was  0.39.  Moreover,  the  degrees  of  adjacent  nodes 
were  positively  correlated  for  the  Wikipedia  network  dataset. 
Therefore,  we  consider  that  the  Wikipedia  network  dataset 
can  be  used  as  an  example  of  social  network. 

B.  A  Comparison  Method 

In  order  to  quantitatively  evaluate  the  strength  of  the 
correlation  between  the  greedy  solution  for  the  influence 
maximization  problem  under  the  IC  model  and  the  SR- 
community  structure,  we  employ  the  community  structure 
obtined  by  the  method  of  Newman  and  Leicht  [12]  as  a 
baseline. 

Given  an  integer  k,  the  method  of  Newman  and  Leicht  [12] 
divides  the  set  S  =  {1,  •  •  •  ,  N}  of  nodes  in  the  network  into 
k  communities,  that  is,  k  disjoint  subsets  of  S,  according 
to  some  probabilistic  mixture  model  that  is  a  probabilistic 
mixture  of  multinomial  distributions.  More  specifically,  their 
method  is  as  follows:  First,  a  probabilistic  generative  model 
for  network  is  given.  Namely,  the  probability  that  a  network 
with  adjacency  matirix  A  appears  is  defiend  by 

N 

P(A\\,0)  =  l[P(A(n,:)\\,6), 

n—1 

where  A (n, :)  denotes  the  nth  row  vector  of  A, 

A  =  {A<?;  £  =  1,  •••  ,k}, 
d  =  {%;  £=1,--.  ,k,  i  =  1,  -  •  -  ,N} 
are  sets  of  parameters,  and 

k 

P(A(n, :)  |  A,  9)  =  £  XeP(A(n, :)  \£,  9), 

£= 1 

N 

P(A(n,:)\£,e)txY[(eej)A{n’j\ 

i= i 

for  £  =  1,  •  •  •  ,  k  and  n,j  =  l,---,N.  Here,  each  A  a  is  the 
mixture  weight  (prior  probability)  of  the  fth  community,  and 

k 

Ar  >  0,  (€  =  1,  -  ■  ■  ,k),  £>  =  L 

t-\ 

Also,  each  Oej  is  the  probability  that  the  jth  node  connects 
with  a  node  beloging  to  the  /:th  community,  and 

N 

=  1  ’ 
j= i 

for  t  =  1,  ■  •  •  ,k  and  j  =  1  ,N.  By  performing  the 

maximal  likelihood  estimation  using  the  EM  algorithm,  we 
estimate  the  values  of  A  and  9.  Then,  applying  Bayes’  rule, 
we  define  the  community  label  r(n)  for  each  node  n  as 

£*(n)  =  arg  max  P(£  I  A(n, A,  9). 

i<e<k 


For  the  greedy  solution  Sk  =  {i>i ,  •  •  •  ,  u^},  we  detect  the 
set  of  k  communities. 


Zk  =  {z i,  •  •  •  ,  Zk}, 


by  using  the  method  of  Newman  and  Leicht.  For  every  Vi,  we 
define  7 (vi)  by  the  condition  Z1[yp)  9  v^.  In  the  same  way 
as  the  SR-community  structure,  we  quantify  the  strength  of 
the  correlation  between  Sk  and  Zk  by  F(k)  (see,  (8)).  Here, 
we  modify  the  definition  of  F(k)  through  changing  each 
F0{p;Vi)  (see,  (7))  to 


Fo(p\  Vi)  =  200 


H  P)  fi 

\H(vi,p)\  +  |  Zl{vi) 


C.  Experimental  Settings 

In  our  experiments,  we  assigned  a  uniform  probability  f3  to 
the  propagation  probability  f3UyV  for  any  directed  link  (u,  v) 
of  the  network.  As  investigate  by  Leskovec  el  al.  [7],  it  seems 
that  large  cascades  of  information  diffusion  happen  rarely. 
Thus,  we  examined  the  IC  model  with  relatively  small  f3 
according  to  Kempe  at  al.  [5], 

We  estimated  the  greedy  solution  Sk  =  {ui,  •  •  •  ,vk}  using 
the  method  of  Kimura  et  al.  [6]  with  the  parameter  value 
10,  000.  Here,  the  parameter  represents  the  number  of  bond 
percolation  processes  for  estimating  the  influence  degree 
a(X)  of  a  given  initial  active  set  X.  Also,  we  estimated  the 
influence  set  H(vi,p )  of  node  77  with  probability  p  through 
300,  000  simulations  of  the  IC  model. 


D.  Experimental  Results 

We  describe  the  results  for  the  experiments  using  the  blog 
network  dataset  and  the  Wikipedia  network  dataset. 


Fig.  1.  SR-covering  number  Mk  of  greedy  solution  Sk  on  the  blog  network 
dataset. 


Figures  1  and  2  plot  the  SR-covering  number  Mk  of  the 
greedy  solution  Sk  with  respect  to  k  on  the  blog  network 
dataset  and  the  Wikipedia  network  dataset,  respectively.  For 
almost  all  k,  we  observe  that  the  larger  the  value  of  prop¬ 
agation  probability  /3  is,  the  larger  the  SR-covering  number 
Mk  of  Sk  is. 


Fig.  2.  SR-covering  number  of  greedy  solution  Sk  on  the  Wikipedia 
network  dataset. 


Fig.  3.  Strength  F(k)  of  correlation  with  greedy  solution  Sk  on  the  blog 
network  dataset  (/ 3  =  5%). 


Fig.  4.  Strength  F(k)  of  correlation  with  greedy  solution  Sk  on  the  blog 
network  dataset  {(3  =  10%). 


Fig.  5.  Strength  F(k)  of  correlation  with  greedy  solution  Sk  on  the 
Wikipedia  network  dataset  (/ 3  =  1%). 


Fig.  6.  Strength  F(k)  of  correlation  with  greedy  solution  Sk  on  the 
Wikipedia  network  dataset  (/ 3  =  5%). 

Figures  3,  4,  5,  and  6  plot  the  strength  F(k)  of  correlation 
with  the  greedy  solution  Sk  with  respect  to  k ,  (2  <  k  <  30). 
In  Figures  3,  4,  5,  and  6,  the  circles  indicate  the  strength 
of  the  correlation  between  the  greedy  solution  and  the  SR- 
community  structure  (SR),  and  the  squares  indicate  the 
strength  of  the  correlation  between  the  greedy  solution  and 
the  community  structure  obtained  by  the  method  of  Newman 
and  Leicht  (NL).  Figures  3  and  4  show  the  results  for  the 
blog  network  dataset,  and  Figures  5  and  6  show  the  results 
for  the  Wikipedia  network  dataset.  These  results  imply  that 
for  the  IC  model  with  relatively  small  propagation  probability 
/3,  the  SR-community  structure  can  be  more  strongly  corre¬ 
lated  with  the  greedy  solution  than  the  community  structure 
introduced  by  Newman  and  Leicht. 

VI.  Concluding  Remarks 

Aiming  to  obtain  efficient  methods  for  finding  better  ap¬ 
proximate  solutions  for  the  influence  maximization  problem 
on  a  social  network  under  the  IC  model,  we  have  explored 


what  structual  feature  of  the  undelying  network  is  correlated 
with  the  greedy  solution.  We  have  focused  on  the  SR- 
community  structure  of  the  network,  and  analyzed  the  greedy 
solution  in  terms  of  the  SR-community  structure.  Using  real 
large  social  networks  including  a  blog  network,  we  have 
experimentally  demonstrated  that  in  comparison  with  the 
community  structure  introduced  by  Newman  and  Leicht,  the 
SR-community  structure  can  be  strongly  correlated  with  the 
greedy  solution. 

On  the  other  hand,  extensive  verification  of  this  proposi¬ 
tion  with  various  real  social  networks  remains  an  important 
task.  However,  we  have  already  made  substantial  progress, 
and  we  are  encouraged  by  our  initial  results. 
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Abstract 

We  address  the  problem  of  minimizing  the  propagation  of  un¬ 
desirable  things,  such  as  computer  viruses  or  malicious  ru¬ 
mors,  by  blocking  a  limited  number  of  links  in  a  network,  a 
dual  problem  to  the  influence  maximization  problem  of  find¬ 
ing  the  most  influential  nodes  in  a  social  network  for  infor¬ 
mation  diffusion.  This  minimization  problem  is  another  ap¬ 
proach  to  the  problem  of  preventing  the  spread  of  contamina¬ 
tion  by  removing  nodes  in  a  network.  We  propose  a  method 
for  efficiently  finding  a  good  approximate  solution  to  this 
problem  based  on  a  naturally  greedy  strategy.  Using  large  real 
networks,  we  demonstrate  experimentally  that  the  proposed 
method  significantly  outperforms  conventional  link-removal 
methods.  We  also  show  that  unlike  the  strategy  of  removing 
nodes,  blocking  links  between  nodes  with  high  out-degrees  is 
not  necessarily  effective. 

Introduction 

Considerable  attention  has  recently  been  devoted  to  inves¬ 
tigating  the  structure  and  function  of  various  networks  in¬ 
cluding  computer  networks,  social  networks  and  the  World 
Wide  Web  (Newman  2003).  From  a  functional  point  of  view, 
networks  can  mediate  diffusion  of  various  things  such  as  in¬ 
novation  and  topics.  However,  undesirable  things  can  also 
spread  through  networks.  For  example,  computer  viruses 
can  spread  through  computer  networks  and  email  networks, 
and  malicious  rumors  can  spread  through  social  networks 
among  individuals.  Thus,  developing  effective  strategies  for 
preventing  the  spread  of  undesirable  things  through  a  net¬ 
work  is  an  important  research  issue.  Previous  work  studied 
strategies  for  reducing  the  spread  size  by  removing  nodes 
from  a  network.  It  has  been  shown  in  particular  that  the 
strategy  of  removing  nodes  in  decreasing  order  of  out-degree 
can  often  be  effective  (Albert,  Jeong,  and  Barabasi  2000; 
Broder  et  al.  2000;  Callaway  et  al.  2000;  Newman,  For¬ 
rest,  and  Balthrop  2002).  Here  notice  that  removal  of  nodes 
by  necessity  involves  removal  of  links.  Namely,  the  task 
of  removing  links  is  more  fundamental  than  that  of  remov¬ 
ing  nodes.  Therefore,  preventing  the  spread  of  undesirable 
things  by  removing  links  from  the  underlying  network  is  an 
important  problem. 
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In  contrast,  finding  influential  nodes  that  are  effective 
for  the  spread  of  information  through  a  social  network 
is  also  an  important  research  issue  in  terms  of  sociology 
and  “viral  marketing”  (Domingos  and  Richardson  2001; 
Richardson  and  Domingos  2002;  Gruhl  et  al.  2004).  Thus, 
researchers  (Kempe,  Kleinberg,  and  Tardos  2003;  Kimura, 
Saito,  and  Nakano  2007)  have  recently  studied  a  combina¬ 
torial  optimization  problem  called  the  influence  maximiza¬ 
tion  problem  on  a  network  under  the  independent  cascade 
(IC)  model,  a  widely-used  fundamental  probabilistic  model 
of  information  diffusion.  Here,  the  influence  maximization 
problem  is  the  problem  of  extracting  a  set  of  k  nodes  to  tar¬ 
get  for  initial  activation  such  that  it  yields  the  largest  ex¬ 
pected  spread  of  information,  where  k  is  a  given  positive 
integer.  Note  also  that  the  IC  model  can  be  identified  with 
the  so-called  susceptible/infective/recoverd  (SIR)  model  for 
the  spread  of  disease  in  a  network  (Gruhl  et  al.  2004). 

The  problem  we  address  in  this  paper  is  a  dual  problem  to 
the  influence  maximization  problem.  The  problem  is  to  min¬ 
imize  the  spread  of  undesirable  things  by  blocking  a  limited 
number  of  links  in  a  network.  More  specifically,  when  some 
undesirable  thing  starts  with  any  node  and  diffuses  through 
the  network  under  the  IC  model,  we  consider  finding  a  set 
of  k  links  such  that  the  resulting  network  by  blocking  those 
links  minimizes  the  expected  contamination  area  of  the  un¬ 
desirable  thing,  where  k  is  a  given  positive  integer.  We  refer 
to  this  combinatorial  optimization  problem  as  the  contam¬ 
ination  minimization  problem.  For  this  problem,  we  pro¬ 
pose  a  novel  method  for  efficiently  finding  a  good  approx¬ 
imate  solution  on  the  basis  of  a  naturally  greedy  strategy. 
Using  large  real  networks  including  a  blog  network,  we  ex¬ 
perimentally  demonstrate  that  the  proposed  method  signif¬ 
icantly  outperforms  link-removal  heuristics  that  rely  on  the 
well- studied  notions  of  betweenness  and  out-degree.  In  par¬ 
ticular,  we  show  that  unlike  the  case  of  removing  nodes, 
blocking  links  between  nodes  with  high  out-degrees  is  not 
necessarily  effective  for  our  problem. 

Problem  Formulation 

In  this  paper,  we  address  the  problem  of  minimizing  the 
spread  of  undesirable  things  such  as  computer  viruses  and 
malicious  rumors  in  a  network  represented  by  a  directed 
graph  G  =  ( V ,  E ).  Here,  V  and  E  (C  V  x  V)  are  the  sets 
of  all  the  nodes  and  links  in  the  network,  respectively.  We 


assume  the  IC  model  to  be  a  mathematical  model  for  the 
diffusion  process  of  some  undesirable  thing  in  the  network, 
and  investigate  the  contamination  minimization  problem  on 
G.  We  call  nodes  active  if  they  have  been  contaminated  by 
the  undesirable  thing. 

Independent  Cascade  Model 

We  define  the  IC  model  on  graph  G  according  to  the  work 
of  Kempe,  Kleinberg,  and  Tardos  (2003). 

In  the  IC  model,  the  diffusion  process  unfolds  in  discrete 
time-steps  t  >  0,  and  it  is  assumed  that  nodes  can  switch 
their  states  only  from  inactive  to  active,  but  not  from  active 
to  inactive.  Given  an  initial  active  node  v,  we  assume  that 
the  node  v  has  first  become  active  at  time-step  0,  and  all  the 
other  nodes  are  inactive  at  time-step  0.  We  specify  a  real 
value  p  with  0  <  p  <  1  in  advance.  Here,  p  is  referred 
to  as  the  propagation  probability  through  a  link.  The  diffu¬ 
sion  process  proceeds  from  the  initial  active  node  v  in  the 
following  way.  When  a  node  u  first  becomes  active  at  time- 
step  t,  it  is  given  a  single  chance  to  activate  each  currently 
inactive  child  node  w,  and  succeeds  with  probability  p.  If  u 
succeeds,  then  w  will  become  active  at  time-step  t  +  1.  If 
multiple  parent  nodes  of  w  first  become  active  at  time-step 
t,  then  their  activation  attempts  are  sequenced  in  an  arbi¬ 
trary  order,  but  all  performed  at  time-step  t.  Whether  or  not 
u  succeeds,  it  cannot  make  any  further  attempts  to  activate 
w  in  subsequent  rounds.  The  process  terminates  if  no  more 
activations  are  possible. 

For  an  initial  active  node  v,  let  a(v;  G)  denote  the  ex¬ 
pected  number  of  active  nodes  at  the  end  of  the  random  pro¬ 
cess  of  the  IC  model  on  G.  We  call  a(v ;  G )  the  influence 
degree  of  node  v  in  graph  G. 

Contamination  Minimization  Problem 

Now,  we  give  a  mathematical  definition  of  the  contamination 
minimization  problem  on  graph  G  =  (V.  E).  For  prevent¬ 
ing  the  undesirable  thing  from  spreading  through  the  net¬ 
work  under  the  IC  model,  we  aim  to  minimize  the  expected 
contamination  area  (that  is,  the  expected  number  of  active 
nodes)  by  appropriately  removing  a  fixed  number  of  links. 

First,  we  define  the  contamination  degree  c(G)  of  graph 
G  as  the  average  of  influence  degrees  of  all  the  nodes  in  G, 
that  is, 

c(G)  =  (1) 

I  I  vev 

Here,  A  stands  for  the  number  of  elements  of  a  set  A.  For 
any  link  e  £  E,  let  G(e)  denote  the  graph  (V,  E  \  {e}). 
We  refer  to  G(e)  as  the  graph  constructed  by  blocking  e  in 
G.  Similarly,  for  any  D  C  E,  let  G(  D)  denote  the  graph 
(V,  E  \  D ).  We  refer  to  G(D)  as  the  graph  constructed  by 
blocking  D  in  G.  We  define  the  contamination  minimization 
problem  on  graph  G  as  follows:  Given  a  positive  integer  k 
with  k  <  \E\,  find  a  subset  D*  of  E  with  | D*  \  =  k  such  that 
c(G{D*))  <  c(G(D))  for  any  D  C  E  with  \D\  =  k. 

For  a  large  network,  any  straightforward  method  for  ex¬ 
actly  solving  the  contamination  minimization  problem  suf¬ 
fers  from  combinatorial  explosion.  Therefore,  we  consider 
approximately  solving  the  problem. 


Proposed  Method 

We  propose  a  method  for  efficiently  finding  a  good  approx¬ 
imate  solution  to  the  contamination  minimization  problem 
on  graph  G  =  (V,E).  Let  k  be  the  number  of  links  to  be 
blocked  in  this  problem. 

Geedy  Algorithm 

We  approximately  solve  the  contamination  minimization 
problem  on  G  =  (V,  E)  by  the  following  greedy  algorithm: 


1. 

Set  D0  <-  0. 

2. 

Set  Eq  < —  E. 

3. 

Set  G0  <-  G. 

4. 

for  i  =  0  to  k 

-  1  do 

5. 

Choose  a  link  e*  £  Ei  minimizing 

6. 

Set  Di+1  <— 

A  U  {e*}. 

7. 

Set  Ei+ 1  <— 

Ei  \  {e*}. 

8. 

Set  G,;+i 

(V,Ei+ 1). 

9. 

end  for 

Here,  Dk  is  the  set  of  links  blocked,  and  represents  the  ap¬ 
proximate  solution  obtained  by  this  algorithm.  G/i;  is  the 
graph  constructed  by  blocking  Dk  in  graph  G,  that  is,  G k  = 

G(Dk). 

To  implement  this  greedy  algorithm,  we  need  a  method 
for  calculating  {c(Gj(e));  e  £  £)}  in  Step  5  of  the  algo¬ 
rithm.  However,  the  IC  model  is  a  stochastic  process  model, 
and  it  is  an  open  question  to  exactly  calcluate  influence  de¬ 
grees  by  an  efficient  method  (Kempe,  Kleinberg,  and  Tar¬ 
dos  2003).  Therefore,  we  develop  a  method  for  estimating 
{c(Gj(e));  e  £  £)}. 

Kimura,  Saito,  and  Nakano  (2007)  presented  the  bond 
percolation  method  that  efficiently  estimates  the  influence 
degrees  {er(i;;G);  v  £  V}  for  any  directed  graph  G  = 
(V,E).  Thus,  we  can  estimate  c(G,(e))  for  each  e  £  Ei 
by  straightforwardly  applying  the  bond  percolation  method. 
However,  E,  becomes  very  large  for  a  large  network  un¬ 
less  i  is  very  large.  Therefore,  we  propose  a  method  that  can 
estimate  {c(G*(e));  e  £  Ei}  in  a  more  efficient  manner  on 
the  basis  of  the  bond  percolation  method. 

Bond  Percolation  Method 

First,  we  revisit  the  bond  percolation  method  (Kimura,  Saito, 
and  Nakano  2007).  Here,  we  consider  estimating  the  influ¬ 
ence  degrees  {cr(i>;  G^);  v  £V}  for  the  IC  model  with  prop¬ 
agation  probability  p  in  graph  Gj  =  ( V,Ei ). 

It  is  known  that  the  IC  model  is  equivalent  to  the  bond 
percolation  process  that  independently  declares  every  link 
of  Gi  to  be  "occupied”  with  probability  p  (Newman  2003). 
Let  M  be  a  sufficiently  large  positive  integer.  We  perform 
the  bond  percolation  process  M  times,  and  sample  a  set  of 
M  graphs  constructed  by  the  occupied  links, 

{Gr  =  (v,Etmy,  to  =  i,  •  •  • ,  m}  . 

Then,  we  can  approximate  the  influence  degree  a(v\  Gi)  by 

M 

,(r;G,),-^|F(r;C)|. 

m—  1 


Here,  for  any  directed  graph  G  =  ( V.  E),  T(v\  G)  denotes 
the  set  of  all  the  nodes  that  are  reachable  from  node  v  in  the 
graph.  We  say  that  node  u  is  reachable  from  node  v  if  there 
is  a  path  from  utov  along  the  links  in  the  graph.  Let 

V=  |J  S(u;Gim) 

u£U(Gim) 

be  the  strongly  connected  component  (SCC)  decomposition 
of  graph  Gim,  where  S(u;Gim)  denotes  the  SCC  of  G,:m 
that  contains  node  u,  and  G(G,;m)  stands  for  a  set  of  all  the 
representative  nodes  for  the  SCCs  of  Gjm.  The  bond  per¬ 
colation  method  performs  the  SCC  decomposition  of  each 
G™,  and  estimates  all  the  influence  degrees  {a(v,Gi)\ 
v  G  V}  in  Gi  as  follows: 

i  M 

<r(v;  Gi)  =  —  £  G‘m)l  >  (v  e  5 («:  Gim )) ,  (2) 

m—  1 

where  u  G  U{Gim). 


Estimation  Method 

We  are  now  in  a  position  to  give  a  method  for  efficiently 
estimating  {c(Gi(e));  e  G  £)}  in  Step  5  of  the  greedy  algo¬ 
rithm.  We  develop  such  an  estimation  method  on  the  basis 
of  the  bond  percolation  method. 

For  any  directed  graph  G  =  (V,  E),  we  define  tp(G)  by 


v(G) 


vGV 


(3) 


Using  the  bond  percolation  method,  we  consider  estimating 
the  contamination  degree  c(G*)  of  the  graph  Gj  =  ( V,  £) ). 
Then,  by  Equations  (1),  (2)  and  (3),  we  can  estimate  c(Gj) 
as 

1  M 

c(Gi)  =  mE  v{Gn-  (4) 

m=  1 

Here,  note  that  p(Gjm )  is  calculated  by 

<p{Gim)  =  T^r  l^(«;Gim)|  \S(u-,Gim)\.  (5) 

'  '  ueu{Gim) 

We  assume  that  M  is  sufficiently  large.  Then,  by  the  inde¬ 
pendence  of  the  bond  percolation  process,  we  can  estimate 

c(Gj(e))  for  every  eGfij  as 

c(0‘(e>)  =  ImM  £  vian’  (6) 

1  V  m£Mi(e) 


where  G,m  =  (V,  E.™),  and 

Mi(e)  =  {to  G  {1,  •  •  • ,  M};  e  g  . 

We  efficiently  estimate  {c(Gj(e));  e  G  Ei}  by  Equations  (5) 
and  (6)  without  applying  the  bond  percolation  method  for 
every  e  G  Ei.  Namely,  the  proposed  method  can  achieve  a 
great  deal  of  reduction  in  computational  cost  compared  with 
the  coventional  bond  percolation  method. 


Experimental  Evaluation 

Using  two  large  real  networks,  we  experimentally  evaluated 
the  performance  of  the  proposed  method. 

Network  Datasets 

First,  we  employed  a  trackback  network  of  blogs  because 
a  piece  of  information  can  propagate  from  one  blog  au¬ 
thor  to  another  blog  author  through  a  trackback.  Since 
bloggers  discuss  various  topics  and  establish  mutual  com¬ 
munications  by  putting  trackbacks  on  each  other’s  blogs, 
we  regarded  a  link  created  by  a  trackback  as  a  biderec- 
tional  link  for  simplicity.  By  tracing  up  to  ten  steps 
back  in  the  trackbacks  from  the  blog  of  the  theme  “JR 
Fukuchiyama  Line  Derailment  Collision”  in  the  site  “goo” 
(http://blog.goo.ne.jp/usertheme/),  we  col¬ 
lected  a  large  connected  trackback  network  in  May,  2005. 
This  network  was  a  directed  graph  of  12,  047  nodes  and 
79,  920  links.  We  refer  to  this  network  data  as  the  blog  net¬ 
work. 

Next,  we  employed  a  network  of  people  that  was  de¬ 
rived  from  the  “list  of  people”  within  Japanese  Wikipedia. 
Specifically,  we  extracted  the  maximal  connected  compo¬ 
nent  of  the  undirected  graph  obtained  by  linking  two  peo¬ 
ple  in  the  “list  of  people”  if  they  co-occur  in  six  or  more 
Wikipedia  pages,  and  constructed  a  directed  graph  regard¬ 
ing  those  undirected  links  as  bidirectional  ones.  We  refer 
to  this  network  data  as  the  Wikipedia  network.  Here,  the 
total  numbers  of  nodes  and  directed  links  were  9, 481  and 
245, 044,  respectively. 

Note  here  that  these  two  networks  are  strongly  connected. 


Experimental  Settings 

The  IC  model  has  the  propagation  probability  p  as  a  param¬ 
eter.  So  we  determine  the  typical  values  of  p  for  the  blog 
and  Wikipedia  networks,  and  use  them  in  the  experiments. 
Let  us  consider  the  bond  percolation  process  corresponding 
to  the  IC  model  with  propagation  probability  p  in  graph  G 
=  (V,E).  Let  S  be  the  expected  fraction  of  the  maximal 
SCC  in  the  network  constructed  by  occupied  links.  S'  is  a 
function  of  p,  and  as  the  value  of  p  decreases,  the  value  of 
S  decreases.  In  other  words,  as  the  value  of  p  decreases,  the 
original  graph  G  gradually  fragments  into  small  clusters  un¬ 
der  the  corresponding  bond  percolation  process.  Figures  1 
and  2  show  the  network  fragmentation  curves  for  the  blog 
and  Wikipedia  networks,  respectively.  Here,  we  estimated 
S  as  follows: 


1 


M 

s  =  —  y 

M  4-^ 

m= 1 


max  |  S{u-,Gim)\, 


where  M  =  10000.  We  focus  on  the  point  p*  at  which  the 
average  rate  dS/ dp  of  change  of  S  attains  the  maximum,  and 
regard  it  as  the  typical  value  of  p  for  the  network.  Note  that 
p*  is  a  critical  point  of  dS/ dp,  and  defines  one  of  the  features 
intrinsic  to  the  network.  From  Figures  1  and  2,  we  estimated 
p*  to  be  p*  =  0.2  for  the  blog  network  and  p*  =  0.03  for 
the  Wikipedia  network. 

For  the  proposed  method,  we  need  to  specify  the  num¬ 
ber  M  of  performing  the  bond  percolation  process.  In  the 
experiments,  we  used  M  =  10000. 


propagation  probability  p 


Figure  1:  Fragmentation  of  the  blog  network  for  the  IC 
model.  The  fraction  S  of  the  maximal  SCC  as  a  function 
of  the  propagation  probablity  p. 


Comparison  Methods 

We  compared  the  proposed  method  with  two  heuristics 
based  on  the  well-studied  notions  of  betweenness  and  out- 
degree  in  the  field  of  complex  network  theory,  as  well  as  the 
crude  baseline  of  blocking  links  at  random.  We  refer  to  the 
method  of  blocking  links  uniformly  at  random  as  the  ran¬ 
dom  method. 

The  betweenness  score  &g(e)  of  a  link  e  in  a  directed 
graph  G  =  ( V ,  E)  is  defined  as  follows: 


Me) 


E 

u,vEV 


rc<g(e;  u,  v) 
Ne{u,v)  ’ 


where  Nq(u,v)  denotes  the  number  of  the  shortest  paths 
from  node  u  to  node  v  in  G,  and  ?ig(e;  it,  t>)  denotes 
the  number  of  those  paths  that  pass  e.  Flere,  we  set 
riQ(e;u,v) / Nq(u,v)  =  0  if  Ng(u,v)  =  0.  Newman  and 
Girvan  (2004)  successfully  extracted  community  structure  in 
a  network  using  the  following  link-removal  algorithm  based 
on  betweeness: 

1 .  Calculate  betweenness  scores  for  all  links  in  the  network. 

2.  Find  the  link  with  the  highest  score  and  remove  it  from 
the  network. 

3.  Recalculate  betweenness  scores  for  all  remaining  links. 

4.  Repeat  from  Step  2. 

In  particular,  the  notion  of  betweenness  can  be  interpreted  in 
terms  of  signals  traveling  through  a  network.  If  signals  travel 
from  source  nodes  to  destination  nodes  along  the  shortest 
paths  in  a  network,  and  all  nodes  send  signals  at  the  same 
constant  rate  to  all  others,  then  the  betweenness  score  of  a 
link  is  a  measure  of  the  rate  at  which  signals  pass  along  the 
link.  Thus,  we  naively  expect  that  blocking  the  links  with  the 
highest  betweenness  score  can  be  effective  for  preventing 


Figure  2:  Fragmentation  of  the  Wikipedia  network  for  the 
IC  model.  The  fraction  S  of  the  maximal  SCC  as  a  function 
of  the  propagation  probablity  p.  The  upper  and  lower  frames 
show  the  network  fragmentation  curves  for  the  whole  range 
of  p  and  the  range  of  0.01  <  p  <  0.09,  respectively. 


the  spread  of  contamination  in  the  network.  Therefore,  we 
apply  the  method  of  Newman  and  Girvan  (2004)  to  the  con¬ 
tamination  minimization  problem.  We  refer  to  this  method 
as  the  betweenness  method. 

On  the  other  hand,  previous  work  has  shown  that  simply 
removing  nodes  in  order  of  decreasing  out-degrees  works 
well  for  preventing  the  spread  of  contamination  in  most  real 
networks  (Albert,  Jeong,  and  Barabasi  2000;  Broder  et  al. 
2000;  Callaway  et  al.  2000;  Newman,  Forrest,  and  Balthrop 
2002).  Here,  the  out-degree  d(v)  of  a  node  v  means  the 
number  of  outgoing  links  from  the  node  v.  Thus,  blocking 
links  between  nodes  with  high  out-degrees  looks  promising 
for  the  contamination  minimization  problem.  Therefore,  as 
a  comparsion  method,  we  employ  the  method  of  recursively 
blocking  links  e  =  [it,  it]  from  u  to  v  in  decreasing  order  of 
their  scores  d(e), 

d(e)  =  d{u)d(v). 

We  refer  to  this  method  as  the  out-degree  method. 

Experimental  Results 

We  evaluated  the  performance  of  the  proposed  method  and 
compare  it  with  that  of  the  betweenness,  out-degree  and 
random  methods.  Clearly,  the  performance  of  a  method 
for  solving  the  contamination  minimization  problem  can  be 
evaluated  in  terms  of  contamination  degree  c.  We  used  the 
value  of  c  (see  Equations  (4)  and  (5))  that  is  estimated  by  the 
bond  percolation  method  with  M  =  10000. 

Figures  3  and  4  show  the  contamination  degree  c  of  the 
resulting  network  as  a  function  of  the  number  k  of  links 
blocked  for  the  blog  network,  where  the  circles,  triangles, 
diamonds  and  squares  indicate  the  results  for  the  proposed, 
betweenness,  out-degree  and  random  methods,  respectively. 


Figure  3:  Performance  comparison  between  the  proposed 
and  betweenness  methods  in  the  blog  network  for  the  IC 
model  with  p  =  0.2. 


Figure  4:  Performance  comparison  of  the  proposed  method 
for  k  =  100  with  the  out-degree  and  random  methods  in  the 
blog  network  for  the  IC  model  with  p  =  0.2. 


In  Figure  4,  the  dashed  line  indicates  the  contamination  de¬ 
gree  of  the  network  obtained  by  the  proposed  method  for 
k  =  100.  From  Figures  3  and  4,  we  first  see  that  the 
proposed  method  outperformed  the  betweenness,  out-degree 
and  random  methods  for  the  blog  network.  By  taking  into 
account  the  definition  (1)  of  contamination  degree,  we  can 
mention  from  Figure  3  that  the  proposed  method  decreased 
the  expected  number  of  nodes  contaminated  from  about  980 
nodes  to  about  580  nodes  by  blocking  appropriate  100  links 
for  the  blog  network.  Here  note  that  blocking  100  links 
means  blocking  about  0.13%  of  the  links  in  the  blog  net¬ 
work.  Thus,  we  find  from  Figure  3  that  by  appropriately 
blocking  about  0.13%  of  the  links  in  the  blog  network,  the 
proposed  and  betweenness  methods  decreased  contamina¬ 
tion  degree  by  about  41%  and  30%,  respectively.  Hence, 
we  can  deduce  that  the  proposed  method  was  effective,  and 
also  outperformed  the  betweennes  method  by  over  10%  at 
k  =  100  for  the  blog  network.  Moreover,  we  find  from  Fig¬ 
ure  4  that  blocking  100  links  by  using  the  proposed  method 
was  the  same  as  blocking  over  10000  links  by  using  the  out- 
degree  and  random  methods  for  the  blog  network  in  effect. 
Namely,  we  can  deduce  that  the  proposed  method  was  100 
times  more  effective  than  the  out-degree  and  random  meth¬ 
ods  at  k  =  100  for  the  blog  network. 

Figures  5  and  6  display  the  contamination  degree  c  of  the 
resulting  network  as  a  function  of  the  number  k  of  links 
blocked  for  the  Wikipedia  network.  Here,  as  in  Figures  3 
and  4,  the  circles,  triangles,  diamonds  and  squares  indicate 
the  results  for  the  proposed,  betweenness,  out-degree  and 
random  methods,  respectively.  In  Figure  6,  the  dashed  line 
indicates  the  contamination  degree  of  the  network  obtained 
by  the  proposed  method  for  k  =  300.  We  also  see  from 
Figures  5  and  6  that  the  proposed  method  outperformed 
the  betweenness,  out-degree  and  random  methods  for  the 
Wikipedia  network.  In  particualr,  we  observe  from  Figure  5 


that  as  the  value  of  k  increased,  the  performance  difference 
between  the  proposed  and  betweenness  methods  gradually 
increased.  Note  here  that  blocking  300  links  means  blocking 
about  0.12%  of  the  links  in  the  Wikipedia  network.  Thus, 
we  find  from  Figure  5  that  by  appropriately  blocking  about 
0.12%  of  the  links  in  the  Wikipedia  network,  the  proposed 
and  betweenness  methods  decreased  contamination  degree 
by  about  26%  and  16%,  respectively.  Hence,  we  can  de¬ 
duce  that  the  proposed  method  was  effective,  and  also  out¬ 
performed  the  betweennes  method  by  about  10%  at  k  =  300 
for  the  Wikipedia  network.  Moreover,  we  find  from  Fig¬ 
ure  6  that  blocking  300  links  by  using  the  proposed  method 
was  the  same  as  blocking  about  30000  links  by  using  the 
out-degree  and  random  methods  for  the  Wikipedia  network. 
Namely,  we  can  deduce  that  the  proposed  method  was  effec¬ 
tive  about  100  times  as  much  as  the  out-degree  and  random 
methods  at  k  =  300  for  the  Wikipedia  network. 

These  results  imply  that  the  proposed  method  works  ef¬ 
fectively  as  expected,  and  significantly  outperforms  the  con¬ 
ventional  link-removal  heuristics,  that  is,  the  betweenness, 
out-degree  and  random  methods.  This  shows  that  a  signifi¬ 
cantly  better  link-blocking  strategy  for  reducing  the  spread 
size  of  contamination  can  be  obtained  by  explicitly  incor¬ 
porating  the  diffusion  dynamics  of  contamination  in  a  net¬ 
work,  rather  than  relying  solely  on  structual  properties  of 
the  graph. 

We  note  from  Figures  4  and  6  that  the  out-degree  method 
was  almost  the  same  as  or  worse  than  the  random  method 
in  performance.  In  the  task  of  removing  nodes  from  a  net¬ 
work,  the  out-degree  heuristic  has  been  effective  since  many 
links  can  be  blocked  at  the  same  time  by  removing  nodes 
with  high  out-degrees.  However,  we  find  that  in  the  task  of 
blocking  a  limited  number  of  links,  the  strategy  of  blocking 
links  between  nodes  with  high  out-degrees  is  not  necessarily 
effective. 


Figure  5:  Performance  comparison  between  the  proposed 
and  betweenness  methods  in  the  Wikipedia  network  for  the 
IC  model  with  p  =  0.03. 

Conclusion 

In  an  attempt  to  minimize  the  spread  of  undesirable  things 
by  blocking  links  in  a  network,  we  have  considered  the  con¬ 
tamination  minimization  problem,  a  dual  problem  to  the  in¬ 
fluence  maximization  problem  for  social  networks.  This 
minimization  problem  is  another  approach  to  the  problem  of 
preventing  the  spread  of  contamination  by  removing  nodes 
in  a  network,  We  have  proposed  a  novel  method  for  effi¬ 
ciently  finding  a  good  approximate  solution  to  this  problem 
on  the  basis  of  the  greedy  algorithm  and  the  bond  perco¬ 
lation  method.  Using  large-scale  blog  and  Wikipedia  net¬ 
works,  we  have  experimentally  demonstrated  that  the  pro¬ 
posed  method  effectively  works,  and  also  significantly  out¬ 
performs  the  conventional  link-removal  heuristics  based  on 
the  betweenness  and  out-degree.  Moreover,  we  have  found 
that  unlike  the  task  of  removing  nodes,  the  strategy  of  block¬ 
ing  links  between  nodes  with  high  out-degrees  is  not  neces¬ 
sarily  effective  for  our  problem. 
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Abstract.  Effective  visualization  is  vital  for  understanding  a  complex  network, 
in  particular  its  dynamical  aspect  such  as  information  diffusion  process.  Existing 
node  embedding  methods  are  all  based  solely  on  the  network  topology  and  some¬ 
times  produce  counter-intuitive  visualization.  A  new  node  embedding  method 
based  on  conditional  probability  is  proposed  that  explicitly  addresses  diffusion 
process  using  either  the  IC  or  LT  models  as  a  cross-entropy  minimization  prob¬ 
lem,  together  with  two  label  assignment  strategies  that  can  be  simultaneously 
adopted.  Numerical  experiments  were  performed  on  two  large  real  networks,  one 
represented  by  a  directed  graph  and  the  other  by  an  undirected  graph.  The  results 
clearly  demonstrate  the  advantage  of  the  proposed  methods  over  conventional 
spring  model  and  topology-based  cross-entropy  methods,  especially  for  the  case 
of  directed  networks. 


1  Introduction 

Analysis  of  the  structure  and  function  of  complex  networks,  such  as  social,  computer 
and  biochemical  networks,  has  been  a  hot  research  subject  with  considerable  atten¬ 
tion  [10].  A  network  can  play  an  important  role  as  a  medium  for  the  spread  of  various 
information.  For  example,  innovation,  hot  topics  and  even  malicious  rumors  can  prop¬ 
agate  through  social  networks  among  individuals,  and  computer  viruses  can  diffuse 
through  email  networks.  Previous  work  addressed  the  problem  of  tracking  the  propa¬ 
gation  patterns  of  topics  through  network  spaces  [5, 1],  and  studied  effective  “vacci¬ 
nation”  strategies  for  preventing  the  spread  of  computer  viruses  through  networks  [11, 
2].  Widely-used  fundamental  probabilistic  models  of  information  diffusion  through  net¬ 
works  are  the  independent  cascade  (IC)  model  and  the  linear  threshold  (LT)  model  [8, 
5],  Researchers  have  recently  investigated  the  problem  of  finding  a  limited  number  of 
influential  nodes  that  are  effective  for  the  spread  of  information  through  a  network  un¬ 
der  these  models  [8, 9].  In  these  studies,  understanding  the  flow  of  information  through 
networks  is  an  important  research  issue. 
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This  paper  focuses  on  the  problem  of  visualizing  the  information  diffusion  process, 
which  is  vital  for  understanding  its  characteristic  over  a  complex  network.  Existing  node 
embedding  methods  such  as  spring  model  method  [7]  and  cross  entropy  method  [14]  are 
solely  based  on  the  network  topology.  They  do  not  take  account  how  information  dif¬ 
fuses  across  the  network.  Thus,  it  often  happens  that  the  visualized  information  flow  do 
not  match  our  intuitive  understanding,  e.g.,  abrupt  information  flow  gaps,  inconsistency 
between  the  nodes  distance  and  the  reachability  of  information,  irregular  pattern  of  in¬ 
formation  spread,  etc.  This  sometimes  happens  when  visualizing  the  diffusion  process 
for  a  network  represented  by  a  directed  graph. 

Thus,  it  is  important  that  node  embedding  explicitly  reflects  the  diffusion  process 
to  produce  more  natural  visualization.  We  have  devised  a  new  node  embedding  method 
that  incorporates  conditional  probability  of  information  diffusion  between  two  nodes,  a 
target  source  node  where  the  information  is  initially  issued  and  a  non-target  influenced 
node  where  the  information  has  been  received  via  intermediate  nodes.  Our  postulation  is 
that  good  visualization  should  satisfy  the  two  conditions:  path  continuity,  i.e.  any  infor¬ 
mation  diffusion  path  is  continuous  and  path  separability,  i.e.  each  different  information 
diffusion  path  is  clearly  separated  from  each  other.  To  this  end,  the  above  node  embed¬ 
ding  is  coupled  with  two  label  assignment  strategies,  one  with  emphasis  on  influence  of 
initially  activated  nodes,  and  the  other  on  degree  of  information  reachability. 

Extensive  numerical  experiments  were  performed  on  two  large  real  networks,  one 
generated  from  a  large  connected  trackback  network  of  blog  data,  resulting  in  a  di¬ 
rected  graph  of  12,047  nodes  and  53,315  links,  and  the  other,  a  network  of  people, 
generated  from  a  list  of  people  within  a  Japanese  Wikipedia,  resulting  in  an  undirected 
graph  of  9, 48 1  nodes  and  245, 044  links.  The  results  clearly  indicate  that  the  proposed 
probabilistic  visualization  method  satisfies  the  above  two  conditions  and  demonstrate 
its  advantage  over  the  well-known  conventional  methods:  spring  model  and  topology- 
based  cross-entropy  methods,  especially  for  the  case  of  a  directed  network.  The  method 
appeals  well  to  our  intuitive  understanding  of  information  diffusion  process. 


2  Information  Diffusion  Models 

We  mathematically  model  the  spread  of  information  through  a  directed  network  G  = 
(V,  E )  under  the  IC  or  LT  model,  where  V  and  E  (c  V  xV)  stands  for  the  sets  of  all  the 
nodes  and  links,  respectively.  We  call  nodes  active  if  they  have  been  influenced  with 
the  information.  In  these  models,  the  diffusion  process  unfolds  in  discrete  time-steps 
t  >  0,  and  it  is  assumed  that  nodes  can  switch  their  states  only  from  inactive  to  active, 
but  not  from  active  to  inactive.  Given  an  initial  set  S  of  active  nodes,  we  assume  that  the 
nodes  in  S  have  first  become  active  at  time-step  0,  and  all  the  other  nodes  are  inactive 
at  time-step  0. 

2.1  Independent  Cascade  Model 

We  define  the  IC  model.  In  this  model,  for  each  directed  link  ( u ,  v),  we  specify  a  real 
value  [iu  v  with  0  <  j3u,v  <  1  in  advance.  Here  /j„  v  is  referred  to  as  the  propagation 
probability  through  link  (m,  v).  The  diffusion  process  proceeds  from  a  given  initial  active 
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set  S  in  the  following  way.  When  a  node  u  first  becomes  active  at  time-step  t,  it  is 
given  a  single  chance  to  activate  each  currently  inactive  child  node  v,  and  succeeds  with 
probability  puv.  If  u  succeeds,  then  v  will  become  active  at  time-step  t  +  1.  If  multiple 
parent  nodes  of  v  first  become  active  at  time-step  t.  then  their  activation  attempts  are 
sequenced  in  an  arbitrary  order,  but  all  performed  at  time-step  t.  Whether  or  not  u 
succeeds,  it  cannot  make  any  further  attempts  to  activate  v  in  subsequent  rounds.  The 
process  terminates  if  no  more  activations  are  possible. 

For  an  initial  active  set  S ,  let  tp(S )  denote  the  number  of  active  nodes  at  the  end  of 
the  random  process  for  the  IC  model.  Note  that  tp(S)  is  a  random  variable.  Let  cr(S) 
denote  the  expected  value  of  <p(S).  We  call  cr(S)  the  influence  degree  of  5. 

2.2  Linear  Threshold  Model 

We  define  the  LT  model.  In  this  model,  for  every  node  v  e  V,  we  specify,  in  advance,  a 
weight  a>U'V  (>  0)  from  its  parent  node  u  such  that 

Yj  W„,v  <  1, 

uer(v) 

where  C(v)  =  { u  e  V;  (u,  v)  e  E}.  The  diffusion  process  from  a  given  initial  active  set 
S  proceeds  according  to  the  following  randomized  rule.  First,  for  any  node  v  e  V,  a 
threshold  6V  is  chosen  uniformly  at  random  from  the  interval  [0, 1].  At  time-step  t,  an 
inactive  node  v  is  influenced  by  each  of  its  active  parent  nodes,  n,  according  to  weight 
oju  v .  If  the  total  weight  from  active  parent  nodes  of  v  is  at  least  threshold  6V.  that  is, 

Y  a>“’v  - 6v’ 

uert(v) 

then  v  will  become  active  at  time-step  t+ 1.  Here,  rt(v)  stands  for  the  set  of  all  the  parent 
nodes  of  v  that  are  active  at  time-step  t.  The  process  terminates  if  no  more  activations 
are  possible. 

The  LT  model  is  also  a  probabilistic  model  associated  with  the  uniform  distribution 
on  [0,  l]|v|.  Similarly  to  the  IC  model,  we  define  a  random  variable  ip(S)  and  its  expected 
value  cr(S )  for  the  LT  model. 

2.3  Influence  Maximization  Problem 

Let  K  be  a  given  positive  integer  with  K  <  |  V|.  We  consider  the  problem  of  finding  a  set 
of  K  nodes  to  target  for  initial  activation  such  that  it  yields  the  largest  expected  spread 
of  information  through  network  G  under  the  IC  or  LT  model.  The  problem  is  referred 
to  as  the  influence  maximization  problem ,  and  mathematically  defined  as  follows:  Find 
a  subset  S  *  of  V  with  |S  *|  =  K  such  that  cr(S  *)  >  o-(S )  for  every  S  c.  V  with  |S  |  =  K. 

For  a  large  network,  any  straightforward  method  for  exactly  solving  the  influence 
maximization  problem  suffers  from  combinatorial  explosion.  Therefore,  we  approxi¬ 
mately  solve  this  problem.  Here,  UK  -  {u\,  •  •  •  ,  uK)  is  the  set  of  K  nodes  to  target  for 
initial  activation,  and  represents  the  approximate  solution  obtained  by  this  algorithm. 
We  refer  to  Uk  as  the  greedy  solution. 
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Using  large  collaboration  networks,  Kempe  et  al.  [8]  experimentally  demonstrated 
that  the  greedy  algorithm  significantly  outperforms  node-selection  heuristics  that  rely 
on  the  well-studied  notions  of  degree  centrality  and  distance  centrality  in  the  sociology 
literature.  Moreover,  the  quality  of  U K  is  guaranteed: 

cr(UK)  >  |l-ljtr(sy, 

where  S  *K  stands  for  the  exact  solution  to  this  problem. 

To  implement  the  greedy  algorithm,  we  need  a  method  for  calculating  {cr{Uk  U  {v}); 
v  e  V\Uk]  for  1  <  k  <  K.  However,  it  is  an  open  question  to  exactly  calculate  influence 
degrees  by  an  efficient  method  for  the  fC  or  LT  model  [8],  Kimura  et  al.  [9]  presented 
the  bond  percolation  method  that  efficiently  estimates  influence  degrees  {cr{Uk  U  {v}); 
v  G  V  \  U !().  Therefore,  we  estimate  the  greedy  solution  Uk  using  their  method. 

3  Visualization  Method 

We  especially  focus  on  visualizing  the  information  diffusion  process  from  the  target 
nodes  selected  to  be  a  solution  of  the  influence  maximization  problem.  To  this  end, 
we  propose  a  visualization  method  that  has  the  following  characteristics:  1)  utilizing 
the  target  nodes  as  a  set  of  pivot  objects  for  visualization,  2)  applying  a  probabilistic 
algorithm  for  embedding  all  the  nodes  in  the  networks  into  an  Euclidean  space,  and 
3)  varying  appearance  of  the  embedded  nodes  on  the  basis  of  two  label  assignment 
strategies.  In  what  follows,  we  describe  some  details  of  the  probabilistic  embedding 
algorithm  and  the  label  assignment  strategies. 

3.1  Probabilistic  Embedding  Algorithm 

Let  Uk  =  {uk  :  1  <  k  <  K)  c  V  be  a  set  of  target  nodes,  which  maximizes  an 
expected  number  of  influenced  nodes  in  the  network  based  on  an  information  diffusion 
model  such  as  1C  or  LT.  Let  v„  £  Uk  be  a  non-target  node  in  the  network,  then  we 
can  consider  the  conditional  probability  pkn  =  p{vn\uk)  that  a  node  vn  is  influenced 
when  one  target  node  uk  alone  is  set  to  an  initial  information  source.  Here  note  that 
we  can  regard  pkn  as  a  binomial  probability  with  respect  to  a  pair  of  nodes  uk  and  v„. 
In  our  visualization  approach,  we  attempt  to  produce  embedding  of  the  nodes  so  as 
to  preserve  the  relationships  expressed  as  the  conditional  probabilities  for  all  pairs  of 
target  and  non-target  nodes  in  the  network.  We  refer  to  this  visualization  strategy  as  the 
conditional  probability  embedding  ( CE)  algorithm. 

Objective  Function  Let  {x*  :  1  <  k  <  K }  and  {y„  :  1  <  n  <  N}  be  the  embedding 
positions  of  the  corresponding  K  target  nodes  and  N  -\V\-K  non-target  nodes  in  an  M 
dimensional  Euclidean  space.  Hereafter,  the  xk  and  y„  are  called  target  and  non-target 
vectors,  respectively.  As  usual,  we  define  the  Euclidean  distance  between  xk  and  y„  as 
follows: 

M 

dk,n  —  l|X/(  \ n  1 1~  —  ^  '  (^k.m  yn,m)  • 

m=  1 
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Here,  we  introduce  a  monotonic  decreasing  function  p(s)  e  [0, 1]  with  respect  to  ,y  >  0, 
where  p(  0)  =  1  and  p(oo)  =  0. 

Since  p(dk,n)  can  also  be  regarded  as  a  binomial  probability  with  respect  to  x*  and 
y„,  we  can  introduce  a  cross-entropy  (cost)  function  between  pk_n  and  p(dpn )  as  follows: 


&k,n  —  Pk,n^P(dk,n)  (1  Pk,n)  btf  1  pid kjj ) ) ■ 


Since  &k,n  is  minimized  when  p(dpn)  =  Pk.n ,  this  minimization  with  respect  to  x*  and  y„ 
is  consistent  with  our  problem  setting.  In  this  paper,  we  employ  a  function  of  the  form 

p(s)  =exp 

as  the  monotonic  decreasing  function,  but  note  that  our  approach  is  not  restricted  to  this 
form.  Then,  the  total  cost  function  (objective  function)  can  be  defined  as  follows: 

1  N  K  N  K 

£  =  2  XI  2  Pk-ndk'n  _  XI  X/  1 2 3 4  “  Pk^  ^ 1  “  P{dkn))-  ( 1) 

n=l  k=l  n—l  k-\ 

Namely,  our  approach  is  formalized  as  a  minimization  problem  of  the  objective  function 
defined  in  (1)  with  respect  to  {x,t  :  1  <  k  <  K)  and  {y„  :  1  <  n  <  N }. 


Learning  Algorithm  As  the  basic  structure  of  our  learning  algorithms,  we  adopt  a 
coordinate  strategy  just  like  the  EM  (Expectation-Maximization)  algorithm.  First,  we 
adjust  the  target  vectors,  so  as  to  minimize  the  objective  function  by  freezing  the  non¬ 
target  vectors,  and  then,  we  adjust  the  non-target  vectors  by  freezing  the  target  vectors. 
These  two  steps  are  repeated  until  convergence  is  obtained. 

In  the  former  minimization  step  for  the  CE  algorithm,  we  need  to  calculate  the 
derivative  of  the  objective  function  with  respect  to  x„  as  follows: 


& 


Xt 


ds 

dxk 


Z 


Pk,n  f^(dkji) 

1  —  p(dk,n) 


(x*  -  y„). 


(2) 


Since  xp  (k'  4-  k )  disappears  in  (2),  we  can  update  xk  without  considering  the  other 
target  vectors.  In  the  latter  minimization  step  for  the  CE  algorithm,  we  need  to  calculate 
the  following  derivative. 


as 

<9y„ 


Z 

it=i 


Pk,n  P(dk,„) 

1  -  p(t4,n) 


(y«  -  xk). 


In  this  case,  we  update  y„  by  freezing  the  other  non-target  vectors.  Overall,  our  algo¬ 
rithm  can  be  summarized  as  follows: 

1.  Initialize  vectors  xi,  •  •  •  ,  xK  and  yi,  •  •  •  ,  y,v. 

2.  Calculate  gradient  vectors  £Xl,  •  •  •  ,  &Xt: . 

3.  Update  target  vectors  X! ,  •  •  •  ,  \K. 

4.  Calculate  gradient  vectors  £yi,  •  •  •  ,  <5Vv . 
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5.  Update  non-target  vectors  yi,  •  •  •  ,  v,v. 

6.  Stop  if  maxfcj„{||£xJ|,  ||£yJ|)  <  e. 

7.  Return  to  2. 

Here,  a  small  positive  value  e  controls  the  termination  condition. 

3.2  Label  Assignment  Strategies 

In  an  attempt  to  effectively  understand  information  diffusion  process,  we  propose  two 
label  assignment  strategies,  on  which  the  appearance  of  the  embedded  target  and  non¬ 
target  nodes  depends.  The  first  strategy  assigns  labels  to  non-target  nodes  according  to 
the  standard  Bayes  decision  rule. 

l\(v„)  =  arg  ma \{pk,n] 

1  <k<K 

It  is  obvious  that  this  decision  naturally  reflects  influence  of  the  target  nodes.  Note 
that  the  target  node  identification  number  k  corresponds  to  the  order  determined  by  the 
greedy  method,  i.e .,  l\{uk)  =  k. 

In  the  second  strategy,  we  introduce  the  following  probability  quantization  by  noting 
0  <  ma M<k<K{pk,n)  <  1, 


h(vn)  = 


-  \ogh  max  { pkM  | 

1  <k<K 


+ 1, 


where  [x]  returns  the  greatest  integer  not  greater  than  x,  and  b  stands  for  the  base  of 
logarithm.  To  each  node  belonging  to  Z  =  {v„  :  maxi^^f;?^,,}  =  0),  we  assign  as  the 
label  the  maximum  number  determined  by  the  nodes  not  belonging  to  Z.  We  believe 
that  this  quantization  reasonably  reflects  the  degree  of  information  reachability.  Here 
note  that  hiuic)  =  1  because  it  always  becomes  active  at  time  step  t  -  0.  These  labels 
are  further  mapped  to  colors  scales  according  to  some  monotonic  mapping  functions. 


4  Experimental  Evaluation 

4.1  Network  Data 

In  our  experiments,  we  employed  two  sets  of  real  networks  used  in  [9],  which  exhibit 
many  of  the  key  features  of  social  networks.  We  describe  the  details  of  these  network 
data. 

The  first  one  is  a  trackback  network  of  blogs.  Blogs  are  personal  on-line  diaries 
managed  by  easy-to-use  software  packages,  and  have  rapidly  spread  through  the  World 
Wide  Web  [5],  Bloggers  {i.e.,  blog  authors)  discuss  various  topics  by  using  trackbacks. 
Thus,  a  piece  of  information  can  propagate  from  one  blogger  to  another  blogger  through 
a  trackback.  We  exploited  the  blog  “Theme  salon  of  blogs”  in  the  site  “goo”  2,  where  a 
blogger  can  recruit  trackbacks  of  other  bloggers  by  registering  an  interesting  theme. 
By  tracing  up  to  ten  steps  back  in  the  trackbacks  from  the  blog  of  the  theme  “JR 

2  http : //blog . goo . ne . j  p/usertheme/ 
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Fukuchiyama  Line  Derailment  Collision”,  we  collected  a  large  connected  trackback 
network  in  May,  2005.  The  resulting  network  had  12,047  nodes  and  53,315  directed 
links,  which  features  the  so-called  “power-law”  distributions  for  the  out-degree  and  in¬ 
degree  that  most  real  large  networks  exhibit.  We  refer  to  this  network  data  as  the  blog 
network. 

The  second  one  is  a  network  of  people  that  was  derived  from  the  “list  of  people” 
within  Japanese  Wikipedia.  Specifically,  we  extracted  the  maximal  connected  compo¬ 
nent  of  the  undirected  graph  obtained  by  linking  two  people  in  the  “list  of  people”  if 
they  co-occur  in  six  or  more  Wikipedia  pages.  The  undirected  graph  is  represented  by 
an  equivalent  directed  graph  by  regarding  undirected  links  as  bidirectional  ones3.  The 
resulting  network  had  9, 48 1  nodes  and  245, 044  directed  links.  We  refer  to  this  network 
data  as  the  Wikipedia  network. 

Newman  and  Park  [12]  observed  that  social  networks  represented  as  undirected 
graphs  generally  have  the  following  two  statistical  properties  that  are  different  from 
non-social  networks.  First,  they  show  positive  correlations  between  the  degrees  of  ad¬ 
jacent  nodes.  Second,  they  have  much  higher  values  of  the  clustering  coefficient  C  than 
the  corresponding  configuration  models  (i.e.,  random  network  models).  For  the  undi¬ 
rected  graph  of  the  Wikipedia  network,  the  value  of  C  of  the  corresponding  configura¬ 
tion  model  was  0.046,  while  the  actual  measured  value  of  C  was  0.39,  and  the  degrees 
of  adjacent  nodes  were  positively  correlated.  Therefore,  the  Wikipedia  network  has  the 
key  features  of  social  networks. 


4.2  Experimental  Settings 

In  the  IC  model,  we  assigned  a  uniform  probability  p  to  the  propagation  probability 
puv  for  any  directed  link  (u,  v)  of  a  network,  that  is,  /j,iv  =  [3.  We,  first,  determine  the 
typical  value  of  [3  for  the  blog  network,  and  use  it  in  the  experiments.  It  is  known  that 
the  IC  model  is  equivalent  to  the  bond  percolation  process  that  independently  declares 
every  link  of  the  network  to  be  “occupied”  with  probability  [3  [10].  Let  J  denote  the 
expected  fraction  of  the  maximal  strongly  connected  component  (SCC)  in  the  network 
constructed  by  the  occupied  links.  Note  that  J  is  an  increasing  function  of  (3.  We  focus 
on  the  point  /J„  at  which  the  average  rate  of  change  of  /,  dJ/dfi,  attains  the  maximum, 
and  regard  it  as  the  typical  value  of  p  for  the  network.  Note  that  [3,  is  a  critical  point 
of  dJ l dfi,  and  defines  one  of  the  features  intrinsic  to  the  network.  Figure  1  plots  J  as  a 
function  of  p.  Here,  we  estimated  J  using  the  bond  percolation  method  with  the  same 
parameter  value  as  below  [9].  From  this  figure  we  experimentally  estimated  pt  to  be  0.2 
for  the  blog  network.  In  the  same  way,  we  experimentally  estimated  /I,  to  be  0.05  for 
the  Wikipedia  network. 

In  the  LT  model,  we  uniformly  set  weights  as  follows.  For  any  node  v  of  a  network, 
the  weight  a>uv  from  a  parent  node  u  e  r(v )  is  given  by  u»„,v.  =  1  /[T(v)|. 

Once  these  parameters  were  set,  we  estimated  the  greedy  solution  Uk  = [m i , ■ • •  ,  uK\ 
of  targets  and  the  conditional  probabilities  [pr,„;  \  <k  <  K,  \  <n  <N)  using  the  bond 
percolation  method  with  the  parameter  value  10, 000  [9] .  Here,  the  parameter  represents 

3  For  simplicity,  we  call  a  graph  with  bi-directional  links  an  undirected  graph 
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Fig.  1 :  The  fraction  J  of  the  maximal  SCC  as  a  function  of  the  propagation  probability  /? 


the  number  of  bond  percolation  processes  for  estimating  the  influence  degree  cr(S)  of  a 
given  initial  active  set  S . 


4.3  Brief  Description  of  Other  Visualization  Methods  used  for  Comparison 

We  have  compared  the  proposed  method  with  the  two  well  known  methods:  spring 
model  method  [7]  and  standard  cross-entropy  method  [14]. 

Spring  model  method  assumes  that  there  is  a  hypothetical  spring  between  each  con¬ 
nected  node  pair  and  locates  nodes  such  that  the  distance  of  each  node  pair  is  closest  to 
its  minimum  path  length  at  equilibrium.  Mathematically  it  is  formulated  as  minimizing 
(3). 


W\-i  | V| 

K(x)  =  'Yj  X  ~  Hx«  -xv,||)2,  (3) 

U— 1  V—U+ 1 

where  guv  is  the  minimum  path  length  between  node  u  and  node  v,  and  auv  is  a  spring 
constant  which  is  normally  set  to  1/(2 g2 ,,).  Standard  cross-entropy  method  first  defines 
a  similarity  p(||x„  -  xv||2)  =  exp(-||x„  -  x,,||2/2)  between  the  embedding  coordinates  xu 
and  xv  and  uses  the  corresponding  element  auv  of  the  adjacency  matrix  as  a  measure  of 
distance  between  the  node  pair,  and  tries  to  minimize  the  total  cross  entropy  between 
these  two.  Mathematically  it  is  formulated  as  minimizing  (4). 

M-i  I v| 

C(x)=^]  ^  {~au,v  logp(||x„  -  xv||2)  -  (1  -  aUyV)  log(l  -  p(||x„  -  xv||2))} ,  (4) 

U— 1  V—U+ 1 

Here,  note  that  we  used  the  same  function  p  as  before. 
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As  is  clear  from  the  above  formulation,  both  methods  are  completely  based  on  graph 
topology.  They  are  both  non-linear  optimization  problem  and  easily  solved  by  a  stan¬ 
dard  coordinate  descent  method.  Here  note  that  the  applicability  of  the  spring  model 
method  and  cross-entropy  method  is  basically  limitted  to  undirected  networks.  Thus,  in 
order  to  obtain  the  embedding  results  by  using  these  methods  we  neglected  the  direction 
in  the  directed  blog  network  and  regarded  it  as  undirected  one. 


4.4  Experimental  Results 

Two  label  assignment  strategies  are  independent  to  each  other.  They  can  be  used  ei¬ 
ther  separately  or  simultaneously.  Here,  we  used  a  color  mapping  to  both,  and  thus, 
use  them  separately.  The  visualization  results  are  shown  in  four  figures,  each  with  six 
network  figures.  In  each  of  these  four  figures,  the  left  three  show  the  results  of  the  first 
visualization  strategy  (method  1)  and  the  right  three  the  results  of  the  second  visualiza¬ 
tion  strategy  (method  2),  and  the  top  two  show  the  results  of  the  proposed  method  ( CE 
algorithm),  the  middle  two  the  results  of  spring  model  and  the  bottom  two  the  results 
of  the  topology-based  cross  entropy  method.  The  first  two  figures  (Figs.  2  and  3)  cor¬ 
responds  to  the  results  of  blog  network  and  the  last  two  (Figs.  4  and  5)  the  results  of 
Wikipedia  network.  For  each,  the  results  of  the  IC  model  comes  first,  followed  by  the 
results  of  the  LT  model. 

The  most  influential  top  ten  nodes  are  chosen  as  the  target  nodes,  and  the  rest  are  all 
non-target  nodes.  In  the  first  visualization  strategy,  the  color  of  a  non-target  node  indi¬ 
cates  which  target  node  is  most  influential  to  the  node,  whereas  in  the  second  visualiza¬ 
tion  strategy,  it  indicates  how  easily  the  information  diffuses  from  the  most  influential 
target  node  to  reach  the  node.  Note  that  a  non-target  node  is  influenced  by  multiple  tar¬ 
get  nodes  probabilistically,  but  here  the  target  with  the  highest  conditional  probability 
is  chosen.  Thus,  the  most  influential  target  node  is  determined  for  each  non-target  node. 

Observation  of  the  results  of  the  proposed  method  (Figs.  2a,  2b,  3a,  3b,  4a,  4b,  5a, 
and  5b)  indicates  that  the  proposed  method  has  the  following  desirable  characteristics: 
1)  the  target  nodes  tend  to  be  allocated  separately  from  each  other,  and  from  each  tar¬ 
get  node,  2)  the  non-target  nodes  that  are  most  affected  by  the  same  target  node  are 
laid  out  forming  a  band  and  3)  the  reachability  changes  continuously  from  the  highest 
at  the  target  node  to  the  lowest  at  the  other  end  of  the  band.  From  this  observation,  it 
is  confirmed  that  the  two  conditions  we  postulated  are  satisfied  for  the  both  diffusion 
models.  Observation  2)  above,  however,  needs  further  clarification.  Note  that  our  vi¬ 
sualization  does  not  necessarily  cause  the  information  diffusion  to  neighboring  nodes 
to  be  in  the  form  of  a  line  in  the  embedded  space.  For  example,  if  there  is  only  one 
source  (K=l),  the  information  would  diffuse  concentrically.  A  node  in  general  receives 
information  from  multiple  sources.  The  fact  that  the  embedding  result  forms  a  line,  on 
the  contrary,  reveals  an  important  characteristic  that  little  information  is  coming  from 
the  other  sources  for  the  networks  we  analyzed. 

In  the  proposed  method,  non-target  nodes  that  are  readily  influenced  are  easily  iden¬ 
tified,  whereas  those  that  are  rarely  influenced  are  placed  together.  Overlapping  of  the 
color  well  explains  the  relationship  between  each  target  and  a  non-target  node.  For  ex¬ 
ample,  in  Figures  3a  and  3b  it  is  easily  observed  that  the  effect  of  the  target  nodes 
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5,  2  on  non-target  nodes  interferes  with  the  three  bands  that  are  spread  from  the  tar¬ 
get  nodes  8,  3,  10,  and  non-target  nodes  overlap  as  they  move  away  from  the  target 
nodes,  demonstrating  that  a  simple  two-dimensional  visualization  facilitates  how  dif¬ 
ferent  node  groups  overlap  and  how  the  information  flows  from  different  target  nodes 
interfere  each  other.  The  same  observation  applies  for  the  target  nodes  6,  1,9,  7.  On 
the  contrary,  the  target  node  4  has  its  own  effect  separately.  A  similar  argument  is  pos¬ 
sible  for  relationship  within  target  nodes.  For  example,  in  Figures  2a  target  nodes  4, 
5,  6,  8  are  located  in  relatively  near  positions  compared  with  the  other  target  nodes. 
It  is  crucial  to  abstract  and  visualize  the  essence  of  information  diffusion  by  deleting 
the  unnecessary  details  (node  to  node  diffusion).  A  good  explanation  for  the  overlap 
like  the  above  is  not  possible  by  other  visualization  methods.  Further,  the  visualization 
results  of  both  IC  and  LT  models  are  spatially  well  balanced.  In  addition,  there  are  no 
significant  differences  on  the  results  of  visualization  between  the  directed  network  and 
undirected  network.  Both  are  equally  good. 

Observation  of  the  results  of  the  spring  model  (Figs.  2c,  2d,  3c,  3d,  4c,  4d,  5c,  and 
5d)  and  the  topology-based  cross  entropy  method  (Figs.  2e,  2f,  3e,  3f,  4e,  4f,  5e,  and  5f) 
reveals  the  followings.  The  clear  difference  of  these  from  the  proposed  method  is  that  it 
is  not  that  easy  to  locate  the  target  nodes.  This  is  true,  in  particular,  for  the  spring  model. 
It  is  slightly  easier  for  the  standard  cross-entropy  method  because  the  target  nodes  are 
placed  in  the  cluster  centers,  but  clusters  often  overlap,  which  makes  visualization  less 
understandable.  It  is  also  noted  that  those  nodes  with  high  reachability,  i.e.,  nodes  with 
red,  which  should  be  placed  separately  due  to  the  influence  of  different  target  nodes 
are  placed  in  mixture.  Further,  unlike  the  proposed  method,  there  is  clear  difference 
between  the  IC  model  and  the  LT  model.  In  the  IC  model,  we  can  easily  recognize  non¬ 
target  nodes  with  high  reachability,  which  cover  a  large  portion  of  the  network,  whereas 
in  the  LT  model,  such  nodes  covering  only  a  small  portion  are  almost  invisible  in  the 
network.  In  contrast,  we  can  easily  pick  up  such  non-target  nodes  with  high  reachability 
even  for  the  LT  model  in  the  proposed  method. 

We  observe  that  the  standard  cross-entropy  method  is  in  general  better  than  the 
spring  model  method  in  terms  of  the  clarity  of  separability.  The  standard  cross-entropy 
method  does  better  for  the  IC  model  than  for  the  LT  model,  and  is  comparable  to  the 
proposed  method  in  terms  of  the  clarity  of  reachability.  However,  the  results  of  the 
standard  cross-entropy  method  (e.g..  Fig.  2f)  are  unintuitive,  where  the  high  reacha¬ 
bility  non-target  nodes  are  placed  away  from  the  target  nodes,  and  some  target  node 
forms  several  isolated  clusters.  We  believe  that  this  point  is  an  intrinsic  limitation  of  the 
standard  cross-entropy  method. 

The  concept  of  our  visualization  is  based  on  the  notion  that  how  the  information 
diffuses  should  primarily  determine  how  the  visualization  is  made,  irrespective  of  the 
graph  topology.  We  observe  that  the  visualization  which  is  based  solely  on  the  topol¬ 
ogy  has  intrinsic  limitation  when  we  deal  with  a  huge  network  from  the  point  of  both 
computational  complexity  (e.g.,  the  spring  model  does  not  work  for  a  network  with  mil¬ 
lions  nodes)  and  understandability.  Overall,  we  can  conclude  that  the  proposed  method 
provides  better  visualization  which  is  more  intuitive  and  easily  understandable. 
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5  Related  Work  and  Discussion 

As  defined  earlier,  let  K  and  N  be  the  numbers  of  target  and  non-target  nodes  in  a 
network.  Then  the  computational  complexity  of  our  embedding  method  amounts  to 
0(NK ),  where  we  assume  the  number  of  learning  iterations  and  the  embedding  di¬ 
mension  to  be  constants.  This  reduced  complexity  greatly  expands  the  applicability  of 
our  method  over  the  other  representative  network  embedding  methods,  e.g.,  the  spring 
model  method  [7]  and  the  standard  cross-entropy  method  [14],  both  of  which  require 
the  computational  complexity  of  0(N2)  under  the  setting  that  K  <sc  N. 

In  view  of  computational  complexity,  our  visualization  method  is  closely  related  to 
those  conventional  methods,  such  as  FastMap  or  Landmark  Multidimensional  Scaling 
(LMDS),  which  are  based  on  the  Nystrom  approximation  [13].  Typically,  these  meth¬ 
ods  randomly  select  a  set  of  pivot  (or  landmark)  objects,  then  produce  the  embedding 
results  so  as  to  preserve  relationships  between  all  pairs  of  pivot  and  non-pivot  objects. 
In  contrast,  our  method  selects  target  (pivot)  nodes  based  on  the  information  diffusion 
models. 

Our  method  adopts  the  basic  idea  of  the  probabilistic  embedding  algorithms  includ¬ 
ing  Parametric  Embedding  (PE)  [6]  and  Neural  Gas  Cross-Entropy  (NG-CE)  [4].  The 
PE  method  attempts  to  uncover  classification  structures  by  use  of  posterior  probabili¬ 
ties,  while  the  NG-CE  method  is  restricted  to  visualize  the  codebooks  of  the  neural  gas 
model.  Our  purpose,  on  the  other  hand,  is  to  effectively  visualize  information  diffusion 
process.  The  two  visualization  strategies  we  proposed  match  this  aim. 

We  are  not  the  first  to  try  to  visualize  the  information  diffusion  process.  Adar  and 
Adamic  [1]  presented  a  visualization  system  that  tracks  the  flow  of  URL  through  blogs. 
However,  same  as  above,  their  visualization  method  did  not  incorporate  an  information 
diffusion  model.  Further,  they  laid  out  only  a  small  number  of  nodes  in  a  tree  structure, 
and  it  is  unlikely  that  their  approach  scales  up  to  a  large  network. 

Finally  we  should  emphasize  that  unlike  most  representative  embedding  methods 
for  networks  [3],  our  visualization  method  is  applicable  to  large-scale  directed  graphs 
while  incorporating  the  effect  of  information  diffusion  models.  In  this  paper,  however, 
we  also  performed  our  experiments  using  the  undirected  (bi-directional)  Wikipedia  net¬ 
work.  This  is  because  we  wanted  to  include  favorable  evaluation  for  the  comparison 
methods.  As  noted  earlier,  we  cannot  directly  apply  the  conventional  embedding  meth¬ 
ods  to  directed  graphs  without  some  topology  modification  such  as  link  addition  or 
deletion. 

6  Conclusion 

We  proposed  an  innovative  probabilistic  visualization  method  to  help  understand  com¬ 
plex  network.  The  node  embedding  scheme  in  the  method,  formulated  as  a  model-based 
cross-entropy  minimization  problem,  explicitly  take  account  of  information  diffusion 
process,  and  therfore,  the  resulting  visualization  is  more  intuitive  and  easier  to  under¬ 
stand  than  the  state-of-art  approaches  such  as  the  spring  model  method  and  the  standard 
cross-entropy  method.  Our  method  is  efficient  enough  to  be  applied  to  large  networks. 
The  experiments  performed  on  a  large  blog  network  (directed)  and  a  large  Wikipedia 
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network  (undirected)  clearly  demonstrate  the  advantage  of  the  proposed  method.  The 
proposed  method  is  confirmed  to  satisfy  both  path  continuity  and  path  separability  con¬ 
ditions  which  are  the  important  requirement  for  the  visualization  to  be  understandable. 
Our  future  work  includes  the  extension  of  the  proposed  approach  to  the  visualization  of 
growing  networks. 
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Fig.  2:  Visualization  of  IC  model  for  blog  network 
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(a)  Proposed  method  1 


(b)  Proposed  method  2 


(c)  Spring  model  method  1 
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(h)  Color-map  assignment 


Fig.  3:  Visualization  of  LT  model  for  blog  network 
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(c)  Spring  model  method  1 
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Fig.  4:  Visualization  of  IC  model  for  Wikipedia  network 
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(c)  Spring  model  method  1  (d)  Spring  model  method  2 
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Fig.  5:  Visualization  of  LT  model  for  Wikipedia  network 
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Abstract.  We  address  the  problem  of  minimizing  the  spread  of  undesirable  things, 
such  as  computer  viruses  and  malicious  rumors,  by  blocking  a  limited  number  of 
links  in  a  network.  This  optimization  problem  called  the  contamination  minimiza¬ 
tion  problem  is,  not  only  yet  another  approach  to  the  problem  of  preventing  the 
spread  of  contamination  by  removing  nodes  in  a  network,  but  also  a  problem  that 
is  converse  to  the  influence  maximization  problem  of  finding  the  most  influen¬ 
tial  nodes  in  a  social  network  for  information  diffusion.  We  adapted  the  method 
which  we  developed  for  the  independent  cascade  model,  known  for  a  model  for 
the  spread  of  epidemic  disease,  to  the  contamination  minimization  problem  un¬ 
der  the  linear  threshold  model,  a  model  known  for  the  propagation  of  innovation 
which  is  considerably  different  in  nature.  Using  large  real  networks,  we  demon¬ 
strate  experimentally  that  the  proposed  method  significantly  outperforms  conven¬ 
tional  link-removal  methods. 


1  Introduction 

Networks  can  mediate  diffusion  of  various  things  such  as  innovation  and  topics.  How¬ 
ever,  undesirable  things  can  also  spread  through  networks.  For  example,  computer 
viruses  can  spread  through  computer  networks  and  email  networks,  and  malicious  ru¬ 
mors  can  spread  through  social  networks  among  individuals.  Thus,  developing  effective 
strategies  for  preventing  the  spread  of  undesirable  things  through  a  network  is  an  im¬ 
portant  research  issue.  Previous  work  studied  strategies  for  reducing  the  spread  size  by 
removing  nodes  from  a  network.  It  has  been  shown  in  particular  that  the  strategy  of 
removing  nodes  in  decreasing  order  of  out-degree  can  often  be  effective  [1,  2,  3].  Here 
notice  that  removal  of  nodes  by  necessity  involves  removal  of  links.  Namely,  the  task  of 
removing  links  is  more  fundamental  than  that  of  removing  nodes.  Therefore,  prevent¬ 
ing  the  spread  of  contamination  by  blocking  links  from  the  underlying  network  is  an 
important  problem. 

In  contrast,  finding  a  limted  number  of  influential  nodes  that  are  effective  for  the 
spread  of  information  through  a  social  network  is  also  an  important  research  issue  in 
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terms  of  sociology  and  “viral  marketing”  [4,  5,  6],  Widely-used  fundamental  proba¬ 
bilistic  models  of  information  diffusion  through  networks  are  the  independent  cascade 
(IC)  model  and  the  linear  threshold  (LT)  model  [7,  6].  Researchers  have  recently  stud¬ 
ied  a  combinatorial  optimization  problem  called  the  influence  maximization  problem  on 
a  network  under  these  models  [7,  8].  Here,  the  influence  maximization  problem  is  the 
problem  of  extracting  a  set  of  k  nodes  to  target  for  initial  activation  such  that  it  yields 
the  largest  expected  spread  of  information,  where  k  is  a  given  positive  integer.  Note  also 
that  the  IC  and  LT  models  are  fundamental  models  of  contamination  diffusion  process 
on  networks  [6], 

The  problem  we  address  in  this  paper  is  a  problem  that  is  converse  to  the  influence 
maximization  problem.  The  problem  is  to  minimize  the  spread  of  contamination  by 
blocking  a  limited  number  of  links  in  a  network.  More  specifically,  when  some  unde¬ 
sirable  thing  starts  with  any  node  and  diffuses  through  the  network,  we  consider  find¬ 
ing  a  set  of  k  links  such  that  the  resulting  network  by  blocking  those  links  minimizes 
the  expected  contamination  area  of  the  undesirable  thing,  where  k  is  a  given  positive 
integer.  This  combinatorial  optimization  problem  is  referred  to  as  the  contamination 
minimization  problem  [9],  For  the  contamination  minimization  problem  under  the  IC 
model,  Kimura,  Saito  and  Motoda  [9]  presented  a  method  for  efficiently  finding  a  good 
approximate  solution  on  the  basis  of  a  naturally  greedy  strategy. 

In  this  paper,  we  propose  a  method  for  efficiently  finding  a  good  approximate  so¬ 
lution  to  the  contamination  minimization  problem  under  the  LT  model  by  adapting  the 
greedy  method  developed  for  the  problem  under  the  IC  model.  Note  here  that  the  IC  and 
LT  models  considerably  differ  in  quality.  First,  the  LT  model  is  originally  a  model  for 
the  propagation  of  innovation  through  the  network,  while  the  IC  model  can  be  identified 
with  the  SIR  model  [10]  for  the  spread  of  epidemic  disease  in  the  network.  Moreover, 
the  LT  model  is  viewed  as  a  probabilistic  model  defined  on  some  continous  space,  while 
the  IC  model  is  viewed  as  that  on  some  finite  set  (i,e.,  a  discrete  space)  [7,  8],  There¬ 
fore,  the  effectiveness  of  the  greedy  method  for  the  problem  under  the  LT  model  is  not 
self-evident.  To  compare  methods  of  solving  the  problem  for  various  networks  in  per¬ 
formance,  we  newly  introduce  the  contamination  reduction  rate  as  a  performance  mea¬ 
sure.  Using  large  real  social  networks,  we  experimentally  demonstrate  that  the  proposed 
method  significantly  outperforms  link-removal  heuristics  that  rely  on  the  well-studied 
notions  of  betweenness  and  out-degree  in  the  field  of  complex  network  theory. 

2  Problem  Formulation 

In  this  paper,  we  address  the  problem  of  minimizing  the  spread  of  some  undesirable 
thing  in  a  network  represented  by  a  directed  graph  G  =  (V,  E).  Here,  V  and  E  ( c  V  x 
V )  are  the  sets  of  all  the  nodes  and  links  in  the  network,  respectively.  We  assume  the 
LT  model  to  be  a  mathematical  model  for  the  diffusion  process  of  this  undesirable  thing 
in  the  network,  and  investigate  the  contamination  minimization  problem  on  G.  We  call 
nodes  active  if  they  have  been  contaminated  by  the  undesirable  thing. 

2.1  Linear  Threshold  Model 

We  define  the  linear  threshold  (LT)  model  on  graph  G  according  to  [7], 
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In  this  model,  for  any  node  v  6  V,  we  specify,  in  advance,  a  weight  a>u  v  (>  0)  from 
its  parent  node  u  such  that  2«er(v)  w«.v  ^  1-  where  E(v)  is  the  set  of  all  the  parent  nodes 
of  v,  r(v)  -  {u  e  V;  (u,v)  e  E }.  The  diffusion  process  from  a  given  initial  set  of  active 
nodes  proceeds  according  to  the  following  randomized  rule.  First,  for  any  node  v  e  V, 
a  threshold  0Y  is  chosen  uniformly  at  random  from  the  interval  [0,1].  At  time-step  t,  an 
inactive  node  v  is  influenced  by  each  of  its  active  parent  nodes,  u  according  to  weight 
oju  v.  If  the  total  weight  from  active  parent  nodes  of  v  is  at  least  threshold  0,„  that  is, 
Zuerw  0Ju-v  —  then  v  will  become  active  at  time-step  t  +  1.  Here,  r,(v)  stands  for  the 
set  of  all  the  parent  nodes  of  v  that  are  active  at  time-step  t.  The  process  terminates  if 
no  more  activations  are  possible. 

Note  that  the  threshold  8V  models  the  tendency  of  node  v  to  adopt  the  information 
when  its  parent  nodes  do.  Note  also  that  the  LT  model  is  a  probabilistic  model  associated 
with  the  uniform  distribution  on  [0,  l]'yL  Thus,  the  LT  model  is  viewed  as  a  probabilistic 
model  on  the  continous  space  [0, 1]'VL  Here,  |A|  stands  for  the  number  of  elements  of  a 
set  A. 

For  an  initial  active  node  v,  let  <x(v;  G )  denote  the  expected  number  of  active  nodes 
at  the  end  of  the  random  process  of  the  LT  model  on  G.  We  call  cr(v;  G)  the  influence 
degree  of  node  v  in  graph  G. 

2.2  Contamination  Minimization  Problem 

Now,  we  give  a  mathematical  definition  of  the  contamination  minimization  problem  on 
graph  G  =  (V,  E). 

First,  we  define  the  contamination  degree  c(G)  of  graph  G  as  the  average  of  influ¬ 
ence  degrees  of  all  the  nodes  in  G,  that  is, 

c(G)=jVj2>v:G)-  (1) 

1  1  veV 

For  any  link  e  e  E,  let  G(e)  denote  the  graph  (V,  E  \  \e)).  We  refer  to  G(e)  as  the  graph 
constructed  by  blocking  e  in  G.  Similarly,  for  any  I)  c  E,  let  G(D)  denote  the  graph  (V, 
E  \  D).  We  refer  to  G(D)  as  the  graph  constructed  by  blocking  D  in  G.  We  define  the 
contamination  minimization  problem  on  graph  G  as  follows:  Given  a  positive  integer  k 
with  k  <  \E\,  find  a  subset  D*  of  E  with  |D*|  =  k  such  that  c(G(f>*))  <  c(G(D))  for  any 
D  c  E  with  \D\  =  k. 

For  a  large  network,  any  straightforward  method  for  exactly  solving  the  contami¬ 
nation  minimization  problem  suffers  from  combinatorial  explosion.  Therefore,  we  con¬ 
sider  approximately  solving  the  problem. 

3  Proposed  Method 

We  propose  a  method  for  efficiently  finding  a  good  approximate  solution  to  the  contam¬ 
ination  minimization  problem  on  graph  G  =  (V,  E).  We  consider  adapting  the  method 
which  we  developed  for  the  IC  model  to  the  contamination  minimization  problem  under 
the  LT  model  which  is  considerably  different  in  nature.  Let  k  be  the  number  of  links  to 
be  blocked  in  this  problem. 
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3.1  Geedy  Algorithm 

We  approximately  solve  the  contamination  minimization  problem  on  G  =  ( V, ,  E)  by  the 
following  greedy  algorithm: 

1.  Set  A)  <-  0. 

2.  Set  Eq  <—  E. 

3.  Set  Go  <-  G. 

4.  for  i  =  0  to  k  —  1  do 

5.  Choose  a  link  e,  e  E,  minimizing  c(G,(e)),  (e  e  E ,). 

6.  Set  A+i  <-  A  U  {e„}- 

7.  Set  Ej+i  <—  Ej  \  {e*}. 

8.  Set  G,+i  <—  (V,  A+i)- 

9.  end  for 

Here,  A  is  the  set  of  links  blocked,  and  represents  the  approximate  solution  obtained 
by  this  algorithm.  G*  is  the  graph  constructed  by  blocking  A  in  graph  G,  that  is,  G /<  = 

G(A). 

To  implement  this  greedy  algorithm,  we  need  a  method  for  calculating  {c(G,(e)); 
e  e  A)  in  Step  5  of  the  algorithm.  However,  the  LT  model  is  a  stochastic  process 
model,  and  it  is  an  open  question  to  exactly  calcluate  influence  degrees  by  an  efficient 
method  [7].  Therefore,  we  develop  a  method  for  estimating  {c(G,(e));  e  e  E,}. 

Kimura,  Saito,  and  Nakano  [8]  presented  the  bond  percolation  method  that  effi¬ 
ciently  estimates  the  influence  degrees  {cr(y;  G);  v  e  V)  for  any  directed  graph  G  = 
(V,  E).  Thus,  we  can  estimate  c(G,(e))  for  each  e  e  E,  by  straightforwardly  applying 
the  bond  percolation  method.  However,  |£,'l  becomes  very  large  for  a  large  network  un¬ 
less  i  is  very  large.  Therefore,  we  propose  a  method  that  can  estimate  {c(G,(e));  e  6  E, } 
in  a  more  efficient  manner  on  the  basis  of  the  bond  percolation  method. 


3.2  Estimation  Based  on  Bond  Percolation  Method 

It  is  known  that  the  LT  model  is  equivalent  to  the  following  bond  percolation  process 
[7]:  For  any  v  e  V,  we  pick  at  most  one  of  the  incoming  links  to  v  by  selecting  link 
(m,  v;)  with  probability  ojUyV  and  selecting  no  link  with  probability  1  -  2,jer(v)  w«,v-  Then, 
we  declare  the  picked  links  to  be  “occupied”  and  the  other  links  to  be  “unoccupied”. 
Note  here  that  the  equivalent  bond  percolation  process  for  the  LT  model  is  considerably 
different  from  that  of  IC  model. 

In  the  bond  percolation  method  [8],  we  efficiently  estimate  the  influence  degrees 
{cr(v;G/);  v  e  V }  in  the  following  way.  Let  M  be  a  sufficiently  large  positive  inte¬ 
ger.  We  perform  the  bond  percolation  process  M  times,  and  sample  a  set  of  M  graphs, 
{G;m  =  ( V,Ejm );  m  =  1,  •  •  • ,  M },  constructed  by  the  occupied  links.  Then,  using  the 
strongly  connected  decomposition  of  each  G'",  we  efficiently  estimate  the  influence  de¬ 
grees  {cr(v;  G,);  v  e  V }  as 


1  M 

rr(v;G,)=  -  J]|A(v;G,",)|,  (v  6  V), 

m=  1 


(2) 
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(see  [8]  in  detail).  Here,  <F(v',Gi'")  denotes  the  set  of  all  the  nodes  that  are  reachable 
from  node  v  in  the  graph  G,m.  We  say  that  node  u  is  reachable  from  node  v  if  there  is  a 
path  from  u  to  v  along  the  links  in  the  graph. 

We  are  now  in  a  position  to  give  a  method  for  efficiently  estimating  (c(G,(e));  e  e  E,} 
in  Step  5  of  the  greedy  algorithm.  For  the  LT  model,  the  weights  {wu.,,}  must  be  specified 
in  advance.  We  uniformly  set  the  weights  as  follows:  For  any  node  v  e  V,  the  weight 
oju  v  from  a  parent  node  u  e  /Tv)  is  given  by 


1 

"M,v "  |r(v)|  +  r 

Here  note  that  Yiuer(v)  0Ju,v  <  1  for  any  v  e  V,  that  is,  there  exists  a  chance  such  that 
node  v  cannot  become  active  even  if  all  the  parent  nodes  of  v  are  active.  Then,  on  the 
basis  of  Equations  (1)  and  (2),  and  the  independence  of  the  bond  percolation  process, 
we  estimate  (c(G,(e));  e  e  Ej}  by 


c(Gi(e)) 


1 

IM/WI 


meAii(e)  veV 


(e  e  Et) 


without  applying  the  bond  percolation  method  for  every  e  e  /T,,  where  M,(e)  -  {m  e 
{1,  •  ■  ■  ,  M}\  e  i  Ejm}.  Namely,  the  proposed  method  can  achieve  a  great  deal  of  reduc¬ 
tion  in  computational  cost  compared  with  the  coventional  bond  percolation  method. 


4  Experimental  Evaluation 

4.1  Experimental  Settings 

In  our  experiments,  we  employed  two  sets  of  large  real  networks  used  in  [9],  the  blog 
and  Wikipedia  networks,  which  exhibit  many  of  the  key  features  of  social  networks. 
These  are  bidirectional  networks.  The  blog  network  had  12,047  nodes  and  79,920  di¬ 
rected  links,  and  the  Wikipedia  network  had  9,481  nodes  and  245,044  directed  links. 

For  the  proposed  method,  we  need  to  specify  the  number  M  of  performing  the  bond 
percolation  process.  In  the  experiments,  we  used  M  =  10,000  according  to  [8], 


4.2  Comparison  Methods 

We  compared  the  proposed  method  with  two  heuristics  based  on  the  well-studied  no¬ 
tions  of  betweenness  and  out-degree  in  the  field  of  complex  network  theory. 

The  betweenness  score  bg(e)  of  a  link  e  in  a  directed  graph  G  =  (V,  E)  is  defined  as 
follows:  bg(e )  =  Vev  nc(e^  v) //V^Cw,  v),  where  Ng(u,v )  denotes  the  number  of  the 
shortest  paths  from  node  u  to  node  v  in  G,  and  ng (e;  u ,  v)  denotes  the  number  of  those 
shortest  paths  that  pass  e.  Here,  we  set  ng (e;  u,  v)/Ng(u,  v)  =  0  if  Ng(u,  v)  =  0.  Newman 
and  Girvan  [11]  successfully  extracted  community  structure  in  a  network  using  the 
following  link-removal  algorithm  based  on  betweeness: 

1.  Calculate  betweenness  scores  for  all  links  in  the  network. 

2.  Find  the  link  with  the  highest  score  and  remove  it  from  the  network. 
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3.  Recalculate  betweenness  scores  for  all  remaining  links. 

4.  Repeat  from  Step  2. 

In  particular,  the  notion  of  betweenness  can  be  interpreted  in  terms  of  signals  traveling 
through  a  network.  If  signals  travel  from  source  nodes  to  destination  nodes  along  the 
shortest  paths  in  a  network,  and  all  nodes  send  signals  at  the  same  constant  rate  to  all 
others,  then  the  betweenness  score  of  a  link  is  a  measure  of  the  rate  at  which  signals 
pass  along  the  link.  Thus,  we  naively  expect  that  blocking  the  links  with  the  highest 
betweenness  score  can  be  effective  for  preventing  the  spread  of  contamination  in  the 
network.  Therefore,  we  apply  the  method  of  Newman  and  Girvan  [11]  to  the  contami¬ 
nation  minimization  problem.  We  refer  to  this  method  as  the  betweenness  method. 

On  the  other  hand,  previous  work  has  shown  that  simply  removing  nodes  in  order  of 
decreasing  out-degrees  works  well  for  preventing  the  spread  of  contamination  in  most 
real  networks  [1,  2,  3],  Here,  the  out-degree  of  a  node  v  means  the  number  of  outgoing 
links  from  the  node  v.  Therefore,  as  a  comparsion  method,  we  consider  the  straight¬ 
forward  application  of  this  node  removal  method.  Namely,  we  employ  the  method  of 
choosing  nodes  in  decreasing  order  of  out-degree  and  blocking  simultaneously  all  the 
links  attached  to  the  chosen  nodes.  We  refer  to  this  method  as  the  out-degree  method. 
Note  that  the  out-degree  method  can  not  be  applied  for  all  values  of  k  to  the  contamina¬ 
tion  minimization  problem  of  blocking  k  links. 


4.3  Experimental  Results 


We  evaluated  the  performance  of  the  proposed  method  and  compared  it  with  that  of  the 
betweenness  and  out-degree  methods.  Clearly,  the  performance  of  a  method  for  solving 
the  contamination  minimization  problem  can  be  evaluated  in  terms  of  the  contamination 
reduction  rate  CRR  that  is  defined  as  follows: 


CRR  =  100 


c(G)  -  c(G') 
c(G) 


where  G'  stands  for  a  solution  graph  constructed  by  blocking  a  specified  number  of 
links  from  the  original  graph  G.  We  estimated  the  value  of  c  by  the  bond  percolation 
method  with  M  =  10,000  (see  Equations  (1)  and  (2)),  and  computed  the  value  of  CRR. 

Figures  1  and  2  show  the  contamination  reduction  rate  CRR  of  the  resulting  net¬ 
work  as  a  function  of  the  fraction  of  links  blocked,  FLB,  for  the  blog  and  Wikipedia 
networks,  respectively.  Here,  the  circles,  triangles  and  diamonds  indicate  the  results 
for  the  proposed,  betweenness  and  out-degree  methods,  respectively.  In  the  right  fig¬ 
ures  of  Figures  1  and  2,  the  dashed  line  indicates  the  contamination  reduction  rate  of 
the  network  obtained  by  the  proposed  method  when  the  number  of  links  blocked,  k,  is 
500.  Here  note  that  k  =  500  means  FLB  =  0.63%  and  FLB  =  0.20%  in  the  blog  and 
Wikipedia  networks,  respectively.  We  see  that  the  proposed  method  outperformed  the 
betweenness  and  out-degree  methods  for  both  the  blog  and  the  Wikipedaia  networks. 

These  results  imply  that  the  proposed  method  works  effectively  as  expected,  and  sig¬ 
nificantly  outperforms  the  conventional  link-removal  heuristics,  that  is,  the  betweenness 
and  out-degree  methods.  This  shows  that  a  significantly  better  link-blocking  strategy  for 
reducing  the  spread  size  of  contamination  can  be  obtained  by  explicitly  incorporating 
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Fig.  1:  Performance  comparison  of  the  proposed  method  with  the  betweenness  and  out-degree 
methods  in  the  blog  network. 


Fig.  2:  Performance  comparison  of  the  proposed  method  with  the  betweenness  and  out-degree 
methods  in  the  Wikipedia  network. 


the  diffusion  dynamics  of  contamination  in  a  network,  rather  than  relying  solely  on 
structual  properties  of  the  graph. 

In  the  task  of  removing  nodes  from  a  network,  the  out-degree  heuristic  has  been 
effective  since  many  links  can  be  blocked  at  the  same  time  by  removing  nodes  with 
high  out-degrees.  However,  we  find  that  in  the  task  of  blocking  a  limited  number  of 
links,  the  strategy  of  blocking  all  the  links  attached  to  nodes  with  high  out-degrees  is 
not  necessarily  effective. 

5  Conclusion 

In  an  attempt  to  minimize  the  spread  of  undesirable  things,  such  as  computer  viruses 
and  malicious  rumors,  by  blocking  a  limited  number  of  links  in  a  network,  we  have 
invesitigated  the  contamination  minimization  problem  for  the  LT  model  that  is  a  fun¬ 
damental  diffusion  model  on  a  network.  This  minimization  problem  is,  not  only  yet 
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another  approach  to  the  problem  of  preventing  the  spread  of  contamination  by  remov¬ 
ing  nodes  in  a  network,  but  also  a  problem  that  is  converse  to  the  influece  maximization 
problem  of  finding  the  most  influential  nodes  in  a  social  network  for  information  diffu¬ 
sion.  We  have  adapted  the  method  which  we  developed  for  the  IC  model,  known  for  a 
model  for  the  spread  of  epidemic  disease,  to  the  contamination  minimization  problem 
under  the  LT  model,  a  model  known  for  the  propagation  of  innovation  which  is  con¬ 
siderably  different  in  nature.  Using  large-scale  blog  and  Wikipedia  networks,  we  have 
experimentally  demonstrated  that  the  proposed  method  effectively  works,  and  also  sig¬ 
nificantly  outperforms  the  conventional  link-removal  heuristics  based  on  the  between¬ 
ness  and  out-degree. 
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Abstract.  In  this  paper,  we  attempt  to  answer  a  question  ’'What  does  an  informa¬ 
tion  diffusion  model  tell  about  social  network  structure?”  To  this  end,  we  propose 
a  new  scheme  for  empirical  study  to  explore  the  behavioral  characteristics  of  rep¬ 
resentative  information  diffusion  models  such  as  the  IC  (Independent  Cascade) 
model  and  the  LT  (Linear  Threshold)  model  on  large  networks  with  different 
community  structure.  To  change  community  structure,  we  first  construct  a  GR 
(Generalized  Random)  network  from  an  originally  observed  network.  Here  GR 
networks  are  constructed  just  by  randomly  rewiring  links  of  the  original  network 
without  changing  the  degree  of  each  node.  Then  we  plot  the  expected  number 
of  influenced  nodes  based  on  an  information  diffusion  model  with  respect  to  the 
degree  of  each  information  source  node.  Using  large  real  networks,  we  empiri¬ 
cally  found  that  our  proposal  scheme  uncovered  a  number  of  new  insights.  Most 
importantly,  we  show  that  community  structure  more  strongly  affects  information 
diffusion  processes  of  the  IC  model  than  those  of  the  LT  model.  Moreover,  by  vi¬ 
sualizing  these  networks,  we  give  some  evidence  that  our  claims  are  reasonable. 


1  Introduction 

We  can  now  obtain  digital  traces  of  human  social  interaction  with  some  relating  topics 
in  a  wide  variety  of  on-line  settings,  like  Blog  (Weblog)  communications,  email  ex¬ 
changes  and  so  on.  Such  social  interaction  can  be  naturally  represented  as  a  large-scale 
social  network,  where  nodes  (vertices)  correspond  to  people  or  some  social  entities, 
and  links  (edges)  correspond  to  social  interaction  between  them.  Clearly  these  social 
networks  reflect  complex  social  structures  and  distributed  social  trends.  Thus,  it  seems 
worth  putting  some  effort  in  attempting  to  find  empirical  regularities  and  develop  ex¬ 
planatory  accounts  of  basic  functions  in  the  social  networks.  Such  attempts  would  be 
valuable  for  understanding  social  structures  and  trends,  and  inspiring  us  to  lead  to  the 
discovery  of  new  knowledge  and  insights  underlying  social  interaction. 
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A  social  network  can  also  play  an  important  role  as  a  medium  for  the  spread  of 
various  information  [7],  For  example,  innovation,  hot  topics  and  even  malicious  rumors 
can  propagate  through  social  networks  among  individuals,  and  computer  viruses  can 
diffuse  through  email  networks.  Previous  work  addressed  the  problem  of  tracking  the 
propagation  patterns  of  topics  through  network  spaces  [3, 1],  and  studied  effective  “vac¬ 
cination”  strategies  for  preventing  the  spread  of  computer  viruses  through  networks  [8, 
2],  Widely -used  ufundamental  probabilistic  models  of  information  diffusion  through 
networks  are  the  independent  cascade  (IC)  model  and  the  linear  threshold  (LT)  model 
[4, 3].  Researchers  have  recently  investigated  the  problem  of  finding  a  limited  number 
of  influential  nodes  that  are  effective  for  the  spread  of  information  through  a  network 
under  these  models  [4,5],  Moreover,  the  influence  maximization  problem  has  recently 
been  extended  to  general  influence  control  problems  such  as  a  contamination  minimiza¬ 
tion  problem  [6] . 

To  deepen  our  understanding  of  social  networks  and  accelerating  study  on  infor¬ 
mation  diffusion  models,  we  attempt  to  answer  a  question  ’’What  does  an  information 
diffusion  model  tell  about  social  network  structure?”  We  except  that  such  attempts  de¬ 
rive  some  improved  methods  for  solving  a  number  of  problems  based  on  information 
diffusion  models  such  as  the  influence  maximization  problem  [5],  In  this  paper,  we 
propose  a  new  scheme  for  emperical  study  to  explore  the  behavioral  characteristics  of 
representative  information  diffusion  models  such  as  the  IC  model  and  the  LT  model  on 
large  networks  with  different  community  structure.  We  perform  extensive  numerical  ex¬ 
periments  on  two  large  real  networks,  one  generated  from  a  large  connected  trackback 
network  of  blog  data,  resulting  in  a  directed  graph  of  12,047  nodes  and  79,920  links, 
and  the  other,  a  network  of  people,  generated  from  a  list  of  people  within  a  Japanese 
Wikipedia,  resulting  in  an  undirected  graph  of  9, 48 1  nodes  and  245, 044  links.  Through 
these  experiments,  we  show  that  our  proposed  scheme  could  uncover  a  number  of  new 
insights  on  information  diffusion  processes  of  the  IC  model  and  the  LT  model. 


2  Information  Diffusion  Models 

We  mathematically  model  the  spread  of  information  through  a  directed  network  G  = 
(V,  E)  under  the  IC  or  LT  model,  where  V  and  E  (c  V  xV)  stands  for  the  sets  of  all  the 
nodes  and  links,  respectively.  We  call  nodes  active  if  they  have  been  influenced  with 
the  information.  In  these  models,  the  diffusion  process  unfolds  in  discrete  time-steps 
t  >  0,  and  it  is  assumed  that  nodes  can  switch  their  states  only  from  inactive  to  active, 
but  not  from  active  to  inactive.  Given  an  initial  set  S  of  active  nodes,  we  assume  that  the 
nodes  in  S  have  first  become  active  at  time-step  0,  and  all  the  other  nodes  are  inactive 
at  time-step  0. 

2.1  Independent  Cascade  Model 

We  define  the  IC  model.  In  this  model,  for  each  directed  link  ( u,v ),  we  specify  a  real 
value  [5U  V  with  0  <  /Ju,v  <  1  in  advance.  Here  is  referred  to  as  the  propagation 
probability  through  link  (n,  v).  The  diffusion  process  proceeds  from  a  given  initial  active 
set  S  in  the  following  way.  When  a  node  u  first  becomes  active  at  time-step  t,  it  is 
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given  a  single  chance  to  activate  each  currently  inactive  child  node  v,  and  succeeds  with 
probability  /3U  V.  If  u  succeeds,  then  v  will  become  active  at  time-step  t  +  1.  If  multiple 
parent  nodes  of  v  first  become  active  at  time-step  f,  then  their  activation  attempts  are 
sequenced  in  an  arbitrary  order,  but  all  performed  at  time-step  t.  Whether  or  not  u 
succeeds,  it  cannot  make  any  further  attempts  to  activate  v  in  subsequent  rounds.  The 
process  terminates  if  no  more  activations  are  possible. 

For  an  initial  active  set  S ,  let  ip(S)  denote  the  number  of  active  nodes  at  the  end  of 
the  random  process  for  the  IC  model.  Note  that  <p(S )  is  a  random  variable.  Let  cr(S) 
denote  the  expected  value  of  <p(S).  We  call  tr(S )  the  influence  degree  of  5. 

2.2  Linear  Threshold  Model 

We  define  the  LT  model.  In  this  model,  for  every  node  v  e  V,  we  specify,  in  advance,  a 
weight  <jju_v  (>  0)  from  its  parent  node  u  such  that 

W»,v  <  1, 

uer(v ) 

where  C(v)  =  {u  e  V;  (m,  v)  e  E}.  The  diffusion  process  from  a  given  initial  active  set 
S  proceeds  according  to  the  following  randomized  rule.  First,  for  any  node  v  e  V,  a 
threshold  0V  is  chosen  uniformly  at  random  from  the  interval  [0,1].  At  time-step  t,  an 
inactive  node  v  is  influenced  by  each  of  its  active  parent  nodes,  u,  according  to  weight 
oju  v.  If  the  total  weight  from  active  parent  nodes  of  v  is  at  least  threshold  9V,  that  is, 

OJU,V  >  0V, 

uert(v ) 

then  v  will  become  active  at  time-step  t+ 1 .  Here,  Et(v)  stands  for  the  set  of  all  the  parent 
nodes  of  v  that  are  active  at  time-step  t.  The  process  terminates  if  no  more  activations 
are  possible. 

The  LT  model  is  also  a  probabilistic  model  associated  with  the  uniform  distribution 
on  [0,  l]|v|.  Similarly  to  the  IC  model,  we  define  a  random  variable  ip(S )  and  its  expected 
value  cr(S)  for  the  LT  model. 

2.3  Bond  Percolation  Method 

First,  we  revisit  the  bond  percolation  method  [5].  Here,  we  consider  estimating  the 
influence  degrees  ]cr(v;G);  v  e  V)  for  the  IC  model  with  propagation  probability  p  in 
graph  G  =  (V,  E).  For  simplicity  we  assigned  a  uniform  value  p  for  /j„  v. 

It  is  known  that  the  IC  model  is  equivalent  to  the  bond  percolation  process  that 
independently  declares  every  link  of  G  to  be  “occupied”  with  probability  p  [7]. 

It  is  known  that  the  LT  model  is  equivalent  to  the  following  bond  percolation  process 
[4]:  For  any  v  e  V,  we  pick  at  most  one  of  the  incoming  links  to  v  by  selecting  link 
(m,  v)  with  probability  v  and  selecting  no  link  with  probability  1  -  Yiuertv)  w«,v-  Then, 
we  declare  the  picked  links  to  be  “occupied”  and  the  other  links  to  be  “unoccupied”. 
Note  here  that  the  equivalent  bond  percolation  process  for  the  LT  model  is  considerably 
different  from  that  of  IC  model. 
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Let  M  be  a  sufficiently  large  positive  integer.  We  perform  the  bond  percolation  pro¬ 
cess  M  times,  and  sample  a  set  of  M  graphs  constructed  by  the  occupied  links, 

[Gm  =  (V,Em);  m  —  1,  •  •  •  ,  M] . 

Then,  we  can  approximate  the  influence  degree  cr(v;  G)  by 

1  M 

tr(v;G)-  -^mv;G"')|. 

m=  1 

Here,  for  any  directed  graph  G  =  (V,E),  T(y,  G)  denotes  the  set  of  all  the  nodes  that 
are  reachable  from  node  v  in  the  graph.  We  say  that  node  u  is  reachable  from  node  v  if 
there  is  a  path  from  u  to  v  along  the  links  in  the  graph.  Let 

V  =  |J  S(u-  Gm) 

ueWGm) 

be  the  strongly  connected  component  (SCC)  decomposition  of  graph  Gm,  where  S(u; Gm) 
denotes  the  SCC  of  Gm  that  contains  node  u,  and  rU(Gm)  stands  for  a  set  of  all  the  rep¬ 
resentative  nodes  for  the  SCCs  of  Gm.  The  bond  percolation  method  performs  the  SCC 
decomposition  of  each  Gm,  and  estimates  all  the  influence  degrees  {cr(v;  G);  v  e  V)  in 
G  as  follows: 

1  M 

<*v;G)  =  ^2|r(“;Gm)l>  (v  e  S(u-  Gm)) ,  (1) 

m=  1 

where  u  e  rU(Gm). 

3  Proposed  Scheme  for  Experimental  Study 

We  technically  describe  our  proposed  scheme  for  empirical  study  to  explore  the  behav¬ 
ioral  characteristics  of  representative  information  diffusion  models  on  large  networks 
different  community  structure.  In  addition,  we  present  a  method  for  visualizing  such 
networks  in  terms  of  community  structure.  Hereafter,  the  degree  of  a  node  v,  denoted 
by  deg(v),  means  the  number  of  links  connecting  from  or  to  the  node  v. 

3.1  Affection  of  Community  Structure 

As  mentioned  earlier,  our  scheme  consists  of  two  parts.  Namely,  to  change  community 
structure,  we  first  construct  a  GR  (generalized  random)  network  from  an  originally 
observed  network.  Here  GR  networks  are  constructed  just  by  randomly  rewiring  links 
of  the  original  network  without  changing  the  degree  of  each  node  [7],  Then  we  plot  the 
influence  degree  based  on  an  information  diffusion  model  with  respect  to  the  degree  of 
each  information  source  node. 

First  we  describe  the  method  for  constructing  a  GR  network.  By  arbitrary  ordering 
all  links  in  a  given  original  network,  we  can  prepare  a  link  list  Le  =  (e\,  ■  ■  ■  ,e\E\). 
Recall  that  each  directed  link  consists  of  an  ordered  pair  of /row-part  and  to -part  nodes, 
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i.e.,  e  =  (m,  v).  Thus,  we  can  produce  two  node  lists  from  the  list  LF.  that  is,  th efrom- 
part  node  list  LF  and  the  to -part  node  list  LT.  Clearly  the  frequency  of  each  node  v 
appearing  in  LF  (or  LT)  is  equivalent  to  the  out  (or  in)  degree  of  the  node  v.  Therefore, 
by  randomly  reordering  the  node  list  Lj ,  then  concatenating  it  with  the  other  node  list 
LF,  we  can  produce  a  link  list  for  a  GR  network.  More  specifically,  let  L'T  be  a  shuffled 
node  list,  and  we  denote  the  /-th  order  element  of  a  list  L  by  L(Z),  then  the  link  list  of 
the  GR  network  is  L'E  =  ((LF(\),  L'T(l)),  ■  ■  ■  ,  (LF(\E\),  L'T(\E\))).  Here  note  that  to  fairly 
compare  the  GR  network  with  original  one  in  terms  of  influence  degree,  we  excluded 
some  types  of  shuffled  node  lists,  each  of  which  produces  a  GR  network  with  self-links 
of  some  node  or  multiple-links  between  any  two  nodes. 

By  using  the  bond  percolation  method  described  in  the  previous  section,  we  can 
efficiently  obtain  the  influence  degree  <x(y)  for  each  node  v.  Thus  we  can  straightfor¬ 
wardly  plot  each  pair  of  deg(v)  and  <x(y).  Moreover,  to  examine  their  tendency  of  nodes 
with  the  same  degree  6 ,  we  also  plot  the  average  influence  degree  p(6)  calculated  by 


H(8)  = 


1 

|{v  :  deg(y)  =  5} 


Z  cr(v)- 

{ v:deg(v)=S ) 


(2) 


Clearly  we  can  guess  that  nodes  with  larger  degrees  influence  many  other  nodes  in  any 
information  diffusion  models,  but  we  consider  that  it  is  worth  examining  its  curves  in 
more  details. 


3.2  Visualization  of  Community  Structure 

In  order  to  intuitively  grasp  the  original  and  GR  networks  in  terms  of  community  struc¬ 
ture,  we  present  a  visualization  method  that  is  based  on  the  cross-entropy  algorithm 
[11]  for  network  embedding,  and  the  k- core  notion  [10]  for  label  assignment. 

First  we  describe  the  network  embedding  problem.  Let  [xv  :  v  e  V]  be  the  embed¬ 
ding  positions  of  the  corresponding  |V|  nodes  in  an  R  dimensional  Euclidean  space.  As 
usual,  we  define  the  Euclidean  distance  between  x„  and  xw  as  follows: 

R 

du,w  —  INi  Xw||  —  ^  ~  %w,l)  ’ 

r-l 

Here  we  introduce  a  monotonic  decreasing  function  p(s)  e  [0, 1]  with  respect  to  ,v  >  0, 
where  p( 0)  =  1  and  p(oo)  =  0.  Let  au_w  e  [0, 1}  be  an  adjacency  information  between 
two  nodes  u  and  vv,  indicating  whether  their  exist  a  link  between  them  {au  w  =  1)  or 
not  (a,,  ,,:  =  0).  Then  we  can  introduce  a  cross-entropy  (cost)  function  between  au  w  and 
p(du,w)  as  follows: 

&u,w  =  ~^u,W^P{dUlw)  —  (1  du,w)  ltt(  1  —  p(<fy,w))- 


Since  &u  w  is  minimized  when  p(du,w)  =  ai(jW,  this  minimization  with  respect  to  x„  and  xvv 
basically  coincides  with  our  problem  setting.  In  this  paper,  we  employ  p(,v)  =  exp(-,v/2) 
as  the  monotonic  decreasing  function.  Then  the  total  cost  function  (objective  function) 
can  be  defined  as  follows: 


zz  @U,wdu,H 


ueV  weV 


ueV  weV 


(3) 
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Namely  the  cross-entropy  algorithm  minimizes  the  objective  function  defined  in  (3) 
with  respect  to  {xv  :  v  e  V). 

Next  we  explain  the  k- core  notion.  For  a  given  node  v  in  the  network  G  =  ( V(: ,  Ec), 
we  denote  Ac(v)  as  a  set  of  adjacent  nodes  of  v  as  follows: 

Ac(v)  =  {w  :  {v,  w]  e  Ec)  U  {n  :  {n,  v)  e  Ec). 

A  subnetwork  C(k)  of  G  is  called  k-core  if  each  node  in  C(k)  has  more  than  or  equal  to  k 
adjacent  nodes  in  C(k).  More  specifically,  we  can  define  k-core  subnetwork  as  follows. 
For  a  given  order  k,  the  k-core  is  a  subnetwork  C(k)  =  (Yc(k),Ec(k))  consisting  of  the 
following  node  set  Vc(k)  c  Vc  and  link  set  Vc(k )  c  Vc'. 

Vctk )  =  {v  :  |Ac(£)(v)|  >  kj,  EC(k)  -  [e  :  e  c  Vc(it)}- 

Here  according  to  our  purpose,  we  focus  on  the  subnetwork  of  maximum  size  with  this 
property  as  a  k-core  subnetwork  C(k). 

Finally  we  describe  the  label  assignment  strategy.  As  a  rough  necessary  condition, 
we  assume  that  each  community  over  a  network  includes  a  higher  order  k-core  as  its 
part.  Here  we  consider  that  a  candidate  for  such  higher  core  order  is  greater  than  the 
average  degree  calculated  by  d  —  |£j/|V|.  Then  we  can  summarize  our  visualization 
method  as  follows:  after  embedding  a  given  network  into  an  R  (typically  R  =  2)  dimen¬ 
sional  Euclidean  space  by  use  of  the  cross-entropy  algorithm,  our  visualization  method 
plots  each  node  position  by  changing  the  appearance  of  nodes  belonging  to  its  ([</]  +  1)- 
core  subnetwork.  Here  note  that  [c/]  denotes  the  greatest  integer  smaller  than  d.  By  this 
visualization  method,  we  can  expect  to  roughly  grasp  community  structure  of  a  given 
network. 

4  Experimental  Evaluation 

4.1  Network  Data 

In  our  experiments,  we  employed  two  sets  of  real  networks  used  in  [5],  which  exhibit 
many  of  the  key  features  of  social  networks  as  shown  later.  We  describe  the  details  of 
these  network  data. 

The  first  one  is  a  trackback  network  of  blogs.  Blogs  are  personal  on-line  diaries 
managed  by  easy-to-use  software  packages,  and  have  rapidly  spread  through  the  World 
Wide  Web  [3],  Bloggers  (i.e.,  blog  authors)  discuss  various  topics  by  using  trackbacks. 
Thus,  a  piece  of  information  can  propagate  from  one  blogger  to  another  blogger  through 
a  trackback.  We  exploited  the  blog  “Theme  salon  of  blogs”  in  the  site  “goo”  2,  where  a 
blogger  can  recruit  trackbacks  of  other  bloggers  by  registering  an  interesting  theme. 
By  tracing  up  to  ten  steps  back  in  the  trackbacks  from  the  blog  of  the  theme  “JR 
Fukuchiyama  Line  Derailment  Collision”,  we  collected  a  large  connected  trackback 
network  in  May,  2005.  The  resulting  network  had  12,047  nodes  and  79,920  directed 
links,  which  features  the  so-called  “power-law”  distributions  for  the  out-degree  and  in¬ 
degree  that  most  real  large  networks  exhibit.  We  refer  to  this  network  data  as  the  blog 
network. 

2  http : //blog . goo . ne . j p/usertheme/ 
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The  second  one  is  a  network  of  people  that  was  derived  from  the  “list  of  people” 
within  Japanese  Wikipedia.  Specifically,  we  extracted  the  maximal  connected  compo¬ 
nent  of  the  undirected  graph  obtained  by  linking  two  people  in  the  “list  of  people”  if 
they  co-occur  in  six  or  more  Wikipedia  pages.  The  undirected  graph  is  represented  by 
an  equivalent  directed  graph  by  regarding  undirected  links  as  bidirectional  ones3.  The 
resulting  network  had  9, 48 1  nodes  and  245, 044  directed  links.  We  refer  to  this  network 
data  as  the  Wikipedia  network. 


4.2  Characteristics  of  Network  Data 


Newman  and  Park  [9]  observed  that  social  networks  represented  as  undirected  graphs 
generally  have  the  following  two  statistical  properties  that  are  different  from  non-social 
networks.  First,  they  show  positive  correlations  between  the  degrees  of  adjacent  nodes. 
Second,  they  have  much  higher  values  of  the  clustering  coefficient  C  than  the  corre¬ 
sponding  configuration  model  defined  as  the  ensemble  of  GR  networks.  Here,  the  clus¬ 
tering  coefficient  C  for  an  undirected  network  is  defined  by 

1  y  |{(v  e  V,  w  e  V)  :  v  +  w,  w  e  Ac?(v)}| 

=  \v\£$  \ac(u)\(\ag(u)\  - 1)  • 


Another  widely-used  statistical  measure  of  network  is  the  average  length  of  shortest 
paths  between  any  two  nodes  defined  by 


L  = 


1 

IWI  -  1) 


5>v). 

u±v 


where  l(u,  v)  denotes  the  shortest  path  length  between  nodes  u  and  v.  In  terms  of  infor¬ 
mation  diffusion  processes,  when  L  becomes  smaller  the  probability  that  any  informa¬ 
tion  source  nodes  can  activate  the  other  nodes,  becomes  larger  in  general. 

Table  1  shows  the  basic  statistics  of  the  blog  and  Wikipedia  networks,  together  with 
their  GR  networks.  We  can  see  that  the  measured  value  of  C  for  the  original  blog  net¬ 
work  is  substantially  larger  than  that  of  the  GR  blog  network,  and  the  measured  value 
of  L  for  the  original  blog  network  is  somehow  larger  than  that  of  the  GR  blog  network 
indicating  that  there  exisit  communities.  We  can  observe  a  similar  tendency  for  the 
Wikipedia  networks.  Note  that  we  have  already  confirmed  for  the  original  Wikipedia 
network  that  the  degrees  of  adjacent  nodes  were  positively  correlated,  although  we  de¬ 
rived  the  network  from  Japanese  Wikipedia.  Therefore,  we  can  say  that  the  Wikipedia 
network  has  the  key  features  of  social  networks. 


4.3  Experimental  Settings 

We  describe  our  experimental  settings  of  the  IC  and  LT  models.  In  the  IC  model,  we 
assigned  a  uniform  probability  [i  to  the  propagation  probability  for  any  directed 
link  (u,  v)  of  a  network,  that  is,  /JM,V  =  fi.  As  our  /J  setting,  we  employed  a  reciprocal 

3  For  simplicity,  we  call  a  graph  with  bi-directional  links  an  undirected  graph 
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Table  1:  Basic  statistics  of  networks. 


network 

M 

\E\ 

C 

L 

original  blog 

12,047 

79,920 

0.26197 

8.17456 

GR  blog 

12,047 

79,920 

0.00523 

4.24140 

original  Wikipedia 

9,481 

245,044 

0.55182 

4.69761 

GR  Wikipedea 

9,481 

245,044 

0.04061 

3.12848 

of  the  average  degree,  i.e.,  p  —  \V\/\E\.  The  resulting  propagation  probability  for  the 
original  and  GR  blog  networks  was  p  =  0.1507,  and  p  =  0.0387  for  the  original  and 
GR  Wikipedia  networks.  Incidentally,  these  values  were  reasonably  close  to  those  used 
in  former  study,  i.e.,  p  -  0.2  for  the  blog  networks  and  p  -  0.03  for  the  Wikipedia 
networks  were  used  in  the  former  experiments  [6], 

In  the  LT  model,  we  uniformly  set  weights  as  follows.  For  any  node  v  of  a  net¬ 
work,  the  weight  a>UiV  from  a  parent  node  u  e  F(v)  is  given  by  a>„jV  =  l/|.T(v)|.  This 
experimental  setting  is  exactly  the  same  as  the  one  performed  in  [5], 

For  the  proposed  method,  we  need  to  specify  the  number  M  of  performing  the  bond 
percolation  process.  In  the  experiments,  we  used  M  —  10,000  [5],  Recall  that  the  pa¬ 
rameter  M  represents  the  number  of  bond  percolation  processes  for  estimating  the  in¬ 
fluence  degree  cr(v)  of  a  given  initial  active  node  v.  In  our  preliminary  experiments,  we 
have  already  confirmed  that  the  influence  degree  of  each  node  for  these  networks  with 
M  —  10, 000  are  comparable  to  those  with  M  =  300, 000. 

4.4  Experimental  Results  Using  Blog  Network 

Figure  la  shows  the  influence  degree  based  on  the  IC  model  with  respect  to  the  degree 
of  each  information  source  node  over  the  original  blog  network.  Figure  lb  shows  those 
of  the  IC  model  over  the  GR  blog  network.  Figure  lc  shows  those  of  the  LT  model 
over  the  original  Wikipedia  network,  and  Figure  Id  shows  those  of  the  LT  model  over 
the  GR  Wilipedia  netwok.  Here  the  red  dots  and  blue  circles  respectively  stand  for  the 
levels  of  the  influence  degree  of  individual  nodes  and  their  averages  for  the  nodes  with 
the  same  degree. 

In  view  of  the  difference  between  the  information  diffusion  models,  we  can  clearly 
see  that  although  nodes  with  larger  degrees  influenced  many  other  nodes  in  both  of  the 
IC  and  LT  models,  their  average  curves  exhibit  opposite  curvatures  as  shown  in  these 
results.  In  addition,  we  can  observe  that  the  influence  degree  of  the  individual  nodes 
based  on  the  IC  model  have  quite  large  variances  compared  with  those  of  the  LT  model. 

In  view  of  the  difference  between  the  original  and  GR  networks,  we  can  see  that 
compared  with  the  original  networks,  the  levels  of  the  influence  degree  were  somewhat 
larger  in  the  GR  networks.  We  consider  that  this  is  because  the  averages  of  shortest  path 
lengths  became  substantially  larger  than  those  of  the  GR  networks,  especially  for  the  IC 
model.  In  the  case  of  the  LT  model  over  the  GR  network  (Figure  Id),  we  can  observe 
that  the  influence  degree  was  almost  uniquely  determined  by  the  degree  of  each  node. 
As  the  most  remarkable  point,  in  the  case  of  the  IC  model,  we  can  observe  a  number 
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of  lateral  lines  composed  of  the  individual  influence  degree  over  the  original  networks 
(Figure  la),  but  these  lines  disappeared  over  the  GR  networks  (Figure  lb). 


(b)  GR  network  by  IC  model 


(c)  Original  network  by  LT  model  (d)  GR  network  by  LT  model 


Fig.  1:  Comparison  of  information  diffusion  processes  using  blog  network 


4.5  Experimental  Results  Using  Wikipedia  Network 

Figure  2  shows  the  same  experimental  results  using  the  Wikipedia  networks.  From  these 
results,  we  can  derive  arguments  similar  to  those  of  the  blog  networks.  Thus  we  consider 
that  our  arguments  were  substantially  strengthen  by  these  experiments. 

We  summarize  the  main  points  below.  1)  Nodes  with  larger  degrees  influenced  many 
other  nodes,  but  their  average  curves  of  the  IC  and  LT  models  exhibited  opposite  curva¬ 
tures;  2)  The  levels  of  the  influence  degree  over  the  GR  networks  were  somewhat  larger 
than  those  of  the  original  networks  in  both  of  the  IC  and  LT  models;  3)  The  influence 
degree  was  almost  uniquely  determined  by  the  degree  of  each  node  in  the  case  of  LT 
model  using  the  GR  network  (Figure  2d);  and  4)  A  number  of  lateral  lines  composed 


10 


Authors  Suppressed  Due  to  Excessive  Length 


of  the  individual  influence  degree  appeared  in  the  case  of  IC  model  using  the  original 
network  (Figure  2a). 


(c)  Original  network  by  LT  model  (d)  GR  network  by  LT  model 

Fig.  2:  Comparison  of  information  diffusion  processes  using  wikipedia  network 


4.6  Community  Structure  Analysis 

Figure  3  shows  our  visualization  results.  Here,  in  the  case  of  the  blog  networks,  since 
the  average  degree  was  d  =  6.6340,  we  represented  the  nodes  belonging  to  the  7- 
core  subnetwork  by  red  points,  and  others  by  blue  points.  Similarly,  in  the  case  of  the 
Wikipedia  networks,  since  the  average  degree  was  d  =  25.8458,  we  represented  the 
nodes  belonging  to  the  26-core  subnetwork  by  red  points,  and  others  by  blue  points. 
These  visualization  results  show  that  the  nodes  of  higher  core  order  are  scattered  here 
and  there  in  the  original  networks  (Figures  3a  and  3c),  while  those  nodes  are  concen¬ 
trated  near  the  center  in  the  GR  network  (Figures  3b  and  3d).  This  clearly  indicates  that 
the  transformation  to  GR  networks  changes  community  structure  from  distributed  to 
lumped  ones. 
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Since  the  main  difference  between  the  original  and  GR  networks  are  their  com¬ 
munity  structure,  we  consider  that  a  number  of  lateral  lines  appeared  in  the  original 
networks  using  the  IC  model  (Figures  la  and  2a),  are  closely  related  to  distributed 
community  structure  of  social  networks.  On  the  other  hand,  we  cannot  observe  such 
remarkable  characteristics  for  the  LT  model  (Figures  lb  and  2b).  In  consequence,  we 
can  say  that  community  structure  more  strongly  affects  information  diffusion  processes 
of  the  IC  model  than  those  of  the  LT  model. 


(c)  Original  wikipedia  Networ  (d)  GR  wikipedia  Network 

Fig.  3:  Visualization  of  Networks 


5  Conclusion 

In  this  paper,  we  proposed  a  new  scheme  for  empirical  study  to  explore  the  behavioral 
characteristics  of  representative  information  diffusion  models  such  as  the  Independent 
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Cascade  model  and  the  Linear  Threshold  model  on  large  networks  with  different  com¬ 
munity  structure.  The  proposed  scheme  consists  of  tow  parts,  i.e.,  GR  (generalized  ran¬ 
dom)  network  construction  from  an  originally  observed  network,  and  plotting  of  the  in¬ 
fluence  degree  of  each  node  based  on  an  information  diffusion  model.  Using  large  real 
networks,  we  empirically  found  that  our  proposal  scheme  uncovers  a  number  of  new 
insights.  Most  importantly,  we  showed  that  community  structure  more  strongly  affects 
information  diffusion  processes  of  the  IC  model  than  those  of  the  LT  model.  Our  future 
work  includes  the  analysis  of  relationships  between  community  structure  and  informa¬ 
tion  diffusion  models  by  using  a  wide  variety  of  social  networks.  We  are  also  planing  to 
perform  further  experiments  by  elaborating  probability  settings  to  information  diffusion 
models. 
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We  address  the  problem  of  minimizing  the  propagation  of  undesirable  things,  such  as  computer 
viruses  or  malicious  rumors,  by  blocking  a  limited  number  of  links  in  a  network,  which  is  converse 
to  the  influence  maximization  problem  in  which  the  most  influential  nodes  for  information  diffu¬ 
sion  is  searched  in  a  social  network.  This  minimization  problem  is  more  fundamental  than  the 
problem  of  preventing  the  spread  of  contamination  by  removing  nodes  in  a  network.  We  introduce 
two  definitions  for  the  contamination  degree  of  a  network,  accordingly  define  two  contamination 
minimization  problems,  and  propose  methods  for  efficiently  finding  good  approximate  solutions  to 
these  problems  on  the  basis  of  a  naturally  greedy  strategy.  Using  large  social  networks,  we  experi¬ 
mentally  demonstrate  that  the  proposed  methods  outperform  conventional  link-removal  methods. 
We  also  show  that  unlike  the  case  of  blocking  a  limited  number  of  nodes,  the  strategy  of  removing 
nodes  with  high  out-degrees  is  not  necessarily  effective  for  these  problems. 

Categories  and  Subject  Descriptors:  G.2.2  [Discrete  Mathematics]:  Graph  Theory — network 
problems ;  H.2.8  [Database  Management]:  Database  Applications — data  mining ;  J.4  [Com¬ 
puter  Applications]:  Social  and  Behavioral  Sciences — sociology 

General  Terms:  Algorithms 

Additional  Key  Words  and  Phrases:  Contamination  diffusion,  link  analysis,  social  networks 


1.  INTRODUCTION 

Considerable  attention  has  recently  been  devoted  to  investigating  the  structure  and 
function  of  various  networks  including  computer  networks,  social  networks  and  the 
World  Wide  Web  [Newman  2003].  From  a  functional  point  of  view,  networks  can 
mediate  diffusion  of  various  things  such  as  innovation  and  topics.  However,  unde¬ 
sirable  things  can  also  spread  through  networks.  For  example,  computer  viruses 
can  spread  through  computer  networks  and  email  networks,  and  malicious  rumors 
can  spread  through  social  networks  among  individuals.  Thus,  developing  effective 
strategies  for  preventing  the  spread  of  undesirable  things  through  a  network  is  an 
important  research  issue.  Previous  work  studied  strategies  for  reducing  the  spread 
size  by  removing  nodes  from  a  network.  It  has  been  shown  in  particular  that  the 
strategy  of  removing  nodes  in  decreasing  order  of  out-degree  can  often  be  effective 
[Albert  et  al.  2000;  Broder  et  al.  2000;  Callaway  et  al.  2000;  Newman  et  al.  2002]. 
Here  notice  that  removal  of  nodes  by  necessity  involves  removal  of  links.  Namely, 
the  task  of  removing  links  is  more  fundamental  than  that  of  removing  nodes,  and 
this  is  the  problem  we  address  in  the  paper. 
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In  contrast,  finding  influential  nodes  that  are  effective  for  the  spread  of  infor¬ 
mation  through  a  social  network  is  also  an  important  research  issue  in  terms  of 
sociology  and  “viral  marketing”  [Domingos  and  Richardson  2001;  Richardson  and 
Domingos  2002;  Gruhl  et  al.  2004].  Recent  studies  include  attempts  to  solve  a 
combinatorial  optimization  problem  called  the  influence  maximization  problem  on 
a  network  under  the  independent  cascade  (IC)  model,  a  widely-used  fundamental 
probabilistic  model  of  information  diffusion [Kempe  et  al.  2003;  Kimura  et  al.  2007]. 
Here,  the  influence  maximization  problem  is  the  problem  of  extracting  a  set  of  K 
nodes  to  target  for  initial  activation  such  that  it  yields  the  largest  expected  spread 
of  information,  where  K  is  a  given  positive  integer.  Note  also  that  the  IC  model 
can  be  identified  with  the  so-called  susceptible/infective/recovered  (SIR)  model  for 
the  spread  of  disease  in  a  network  [Gruhl  et  al.  2004] . 

As  we  see,  what  we  address  in  this  paper  is  a  problem  that  is  converse  to  the 
influence  maximization  problem.  The  problem  is  to  minimize  the  spread  of  unde¬ 
sirable  things  by  blocking  a  limited  number  of  links  in  a  network.  More  specifically, 
we  consider,  when  some  undesirable  thing  starts  with  any  node  and  diffuses  through 
the  network  under  the  IC  model,  finding  a  set  of  K  links  such  that  the  resulting  net¬ 
work  obtained  by  blocking  those  links  minimizes  the  contamination  degree  for  the 
undesirable  thing,  where  K  is  a  given  positive  integer.  We  refer  to  this  combinato¬ 
rial  optimization  problem  as  a  contamination  minimization  problem.  We  introduce 
two  definitions  for  the  contamination  degree  of  a  network;  the  average  contami¬ 
nation  degree  and  the  worst  contamination  degree.  According  to  these  definitions, 
we  formalize  two  contamination  minimization  problems;  the  average  contamination 
minimization  problem  and  the  worst  contamination  minimization  problem.  The 
former  aims  to  minimize  the  expected  number  of  contaminated  nodes  (i.e.,  the  av¬ 
erage  case),  and  the  latter  aims  to  minimize  the  maximum  number  of  contaminated 
nodes  (i.e.,  the  worst  case). 

We  presented  in  [Kimura  et  al.  2008]  a  method  for  efficiently  finding  a  good 
approximate  solution  on  the  basis  of  a  naturally  greedy  strategy  for  the  average 
contamination  minimization  problem.  In  this  paper,  we  explain  the  method  in 
more  detail,  and  propose  a  novel  method  for  efficiently  finding  a  good  approxi¬ 
mate  solution  on  the  basis  of  the  same  greedy  strategy  for  the  worst  contamination 
minimization  problem. 

Furthermore,  for  both  the  average  and  the  worst  contamination  minimization 
problems,  we  compare  the  proposed  methods  with  a  naive  greedy  strategy  in  terms 
of  computational  complexity,  and  show  that  the  proposed  methods  can  achieve 
a  great  deal  of  reduction  in  computational  cost.  We  also  present  strategies  for 
making  the  proposed  methods  computationally  more  efficient  in  practice.  Finally, 
using  large  real  networks  that  exhibit  many  of  the  key  features  of  social  networks, 
we  experimentally  demonstrate  that  the  proposed  methods  outperform  link-removal 
heuristics  that  rely  on  the  well-studied  notions  of  betweenness  and  out-degree  in 
the  field  of  complex  network  theory.  In  particular,  we  show  that  unlike  the  case 
of  blocking  a  limited  number  of  nodes,  the  strategy  of  removing  nodes  with  high 
out-degrees  is  not  necessarily  effective  for  our  problems. 
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2.  INFORMATION  DIFFUSION  MODEL 

We  assume  the  IC  model  to  be  a  mathematical  model  for  the  diffusion  process 
of  some  undesirable  thing  on  a  network.  We  call  nodes  active  if  they  have  been 
contaminated  by  the  undesirable  thing. 

Let  G  =  (V,  E)  be  a  directed  network,  where  V  and  E  (c  V  x  V)  stand  for 
the  sets  of  all  the  nodes  and  (directed)  links,  respectively.  Throughout  this  paper, 
a  network  means  a  directed  network,  a  link  means  a  directed  link,  and  we  also 
call  a  network  a  graph.  According  to  the  work  of  Kempe  et  al.  [2003],  we  define 
the  IC  model  on  graph  G,  and  recall  a  mathematical  definition  of  the  influence 
maximization  problem  for  the  IC  model  on  graph  G. 


2.1  Independent  Cascade  Model 

First,  we  define  the  IC  model  on  graph  G.  In  the  IC  model,  the  diffusion  process 
unfolds  in  discrete  time-steps  t  >  0,  and  it  is  assumed  that  nodes  can  switch  their 
states  only  from  inactive  to  active,  but  not  from  active  to  inactive.  Given  an  initial 
set  A  of  active  nodes,  we  assume  that  the  nodes  in  A  have  first  become  active  at 
time-step  0,  and  all  the  other  nodes  are  inactive  at  time-step  0.  For  every  e  £  E, 
we  specify  a  real  value  pe  with  0  <  pe  <  1  in  advance.  Here,  pe  is  referred  to  as  the 
propagation  probability  through  link  e. 

The  diffusion  process  proceeds  from  a  given  initial  active  set  A  in  the  following 
way.  When  a  node  u  first  becomes  active  at  time-step  t,  it  is  given  a  single  chance 
to  activate  each  currently  inactive  child  node  w,  and  succeeds  with  probability  pe, 
where  e  =  ( u,w )  £  E.  Here,  for  a  link  e!  =  ( u',w ')  £  E,  nodes  u'  and  w'  are  called 
the  parent  and  child  nodes  of  w'  and  it',  respectively.  If  u  succeeds,  then  w  will 
become  active  at  time-step  t  +  1.  If  multiple  parent  nodes  of  w  first  become  active 
at  time-step  t,  then  their  activation  attempts  are  sequenced  in  an  arbitrary  order, 
but  all  performed  at  time-step  t.  Whether  or  not  u  succeeds,  it  cannot  make  any 
further  attempts  to  activate  w  in  subsequent  rounds.  The  process  terminates  if  no 
more  activations  are  possible. 

For  an  initial  active  set  A,  let  tp(A\  G)  denote  the  number  of  active  nodes  at  the 
end  of  the  random  process  for  the  IC  model  on  G.  Note  that  <p(A;  G)  is  a  random 
variable.  Let  a(A\G)  denote  the  expected  value  of  tp(A\G).  We  call  cr(A;  G)  the 
influence  degree  of  node  set  A  on  graph  G.  When  A  is  in  particular  equal  to  a 
set  of  single  node  {u},  we  simply  denote  cr(A;  G)  by  cx(i>;  G),  and  call  er(i>;  G)  the 
influence  degree  of  node  v  on  graph  G. 


2.2  Influence  Maximization  Problem 

Next,  we  recall  a  mathematical  definition  of  the  influence  maximization  problem 
on  a  network.  Here,  we  consider  maximizing  the  spread  of  desirable  information 
through  graph  G  =  (V,E).  Let  K  be  a  given  positive  integer  with  K  <  \V\.  Here, 
|A'|  stands  for  the  number  of  elements  of  a  set  X.  The  influence  maximization 
problem  on  G  for  the  IC  model  is  defined  as  follows:  Find  a  subset  A*  of  V  with 
| A*  |  =  I\  such  that  er(A*;  G)  >  er(A;  G)  for  every  A  C  V  with  |A|  =  K. 
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3.  PROBLEM  FORMULATION 

We  assume  that  some  undesirable  thing  starts  with  any  node  in  a  network  and 
diffuses  through  the  network  under  the  IC  model.  For  preventing  it  from  spread¬ 
ing  through  the  network,  we  aim  to  minimize  the  contamination  degree  for  the 
undesirable  thing  by  appropriately  removing  a  fixed  number  of  links.  Here,  the 
contamination  degree  of  a  network  is  a  measure  of  how  badly  the  undesirable  thing 
will  contaminate  the  network.  We  give  two  definitions  for  contamination  degree, 
and  mathematically  formalize  two  contamination  minimization  problems  on  a  net¬ 
work. 

3.1  Contamination  Degree 

For  any  graph  G  =  (V,E),  we  introduce  two  definitions  for  contamination  degree 
of  G. 

3.1.1  Average  Contamination  Degree.  We  define  the  average  contamination  de¬ 
gree  co(G)  of  graph  G  as  the  average  of  influence  degrees  of  all  the  nodes  in  G, 

c°(G)  = -f^T  ^  cr(u;G).  (1) 

I  I  vev 

3.1.2  Worst  Contamination  Degree.  We  define  the  worst  contamination  degree 
c+  (G)  of  graph  G  as  the  maximum  of  influence  degrees  of  all  the  nodes  in  G, 

c+(G)  =  max  cr(u;  G).  (2) 

vGV 

3.2  Contamination  Minimization  Problem 

According  to  the  above  definitions  of  contamination  degree,  we  mathematically 
define  the  contamination  minimization  problems  on  a  network,  which  are  converse 
to  the  influence  maximization  problem  on  the  network. 

For  any  graph  G  =  (V,E),  we  denote  by  c(G)  both  the  average  contamination 
degree  Co(G)  and  the  worst  contamination  degree  c+(G).  For  any  link  e  €  E,  let 
G(e)  denote  the  graph  ( V ,  E  \  {e}).  We  refer  to  G(e)  as  the  graph  constructed  by 
blocking  e  in  G.  Similarly,  for  any  D  C  E,  let  G(D)  denote  the  graph  (V,  E  \  D). 
We  refer  to  G(D)  as  the  graph  constructed  by  blocking  D  in  G. 

We  define  the  contamination  minimization  problems  on  a  graph  G  =  (V,  E)  as 
follows:  Given  a  positive  integer  K  with  K  <  \E\,  find  a  subset  D*  of  E  with 
|£)*|  =  K  such  that  c(G(D*))  <  c(G(D))  for  any  D  C  E  with  |£)|  =  K.  The 
contamination  minimization  problem  for  c  =  Co  is  referred  to  as  the  average  con¬ 
tamination  minimization  problem ,  and  the  contamination  minimization  problem  for 
c  =  c+  is  referred  to  as  the  worst  contamination  minimization  problem. 

For  a  large  network,  any  straightforward  method  for  exactly  solving  the  contam¬ 
ination  minimization  problems  suffers  from  combinatorial  explosion.  Therefore,  we 
consider  approximately  solving  the  problems. 

4.  PROPOSED  METHOD 

We  propose  methods  for  efficiently  finding  good  approximate  solutions  to  our  con¬ 
tamination  minimization  problems.  Let  K  be  the  number  of  links  to  be  blocked  in 
the  problems. 
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4.1  Greedy  Algorithm 

We  approximately  solve  the  contamination  minimization  problems  on  a  given  graph 
G0  =  (V0,E0)  by  the  following  greedy  algorithm: 

Al.  Initialize  a  subset  D  of  Eq  as  D  <—  0. 

A2.  Initialize  a  graph  G  =  ( V ,  E)  as  V  <—  Vo  and  E  <—  E0. 

A3.  Choose  a  link  e*  £  E  minimizing  c(G(e)),  (e  £  E). 

A4.  Update  £)  as  D  <—  HU  {e*}. 

A5.  Update  G  =  (V,E)  as  E  <—  E\  {e*}. 

A6.  Return  to  Step  A3  if  |  Z?  |  <  K. 

A7.  Set  Dk  <-  D. 

A8.  Set  —  G. 

Here,  Dk  is  the  set  of  links  blocked,  and  represents  the  approximate  solution  ob¬ 
tained  by  this  algorithm.  We  refer  D k  to  as  the  greedy  solution.  Gk  is  the  graph 
constructed  by  blocking  Dk  in  the  graph  Go,  that  is,  Gk  =  Gq(Dk)- 
To  implement  this  greedy  algorithm,  we  need  methods  for  calculating 

e*  =  arg  min  c(G(e))  (3) 

e£E 

for  a  given  graph  G  =  (V,  E)  in  Step  A3  of  the  algorithm.  The  IC  model  is  a 
stochastic  process  model,  and  it  is  an  open  question  to  exactly  calculate  influence 
degrees  by  an  efficient  method  [Kempe  et  al.  2003].  Therefore,  we  must  develop 
methods  for  efficiently  estimating  {c(G(e));  e  £  E}  for  graph  G  =  (V,E). 

Kimura  et  al.  [2007]  presented  the  bond  percolation  method  that  efficiently  esti¬ 
mates  the  influence  degrees  {cr(v;  G);  v  £  V}  for  any  graph  G  =  (V,E).  Thus,  in 
the  greedy  algorithm,  we  can  estimate  c(G(e))  for  each  e  £  E  by  applying  the  bond 
percolation  method  for  the  graph  G(e)  and  using  Equations  (1)  or  (2).  Namely, 
we  can  simply  estimate  the  greedy  solution  Dk  by  implementing  Step  A3  of  the 
greedy  algorithm  as  follows: 

(1)  Estimate  {c(G(e));  e  £  Ej  by  straightforwardly  performing  the  bond  percola¬ 
tion  method  \E\  times. 

(2)  Find  e*  €  E  such  that  c(G(e*))  <  c(G(e))  for  any  e  £  E. 

We  refer  this  strategy  to  as  the  naive  greedy  strategy.  However,  \E\  becomes  very 
large  for  a  large  network  in  the  greedy  algorithm  unless  K  is  very  large.  Namely, 
the  naive  greedy  strategy  is  not  practical  for  large  networks.  Therefore,  we  propose 
more  efficient  methods  for  estimating  e*  £  E  satisfying  Equation  (3)  on  the  basis 
of  the  bond  percolation  method. 

4.2  Bond  Percolation  Method 

First,  we  revisit  the  bond  percolation  method  [Kimura  et  al.  2007].  Here,  we  con¬ 
sider  estimating  the  influence  degrees  {<r(u;G);  v  £  V}  for  the  IC  model  with 
propagation  probabilities  {pe;e  £  E}  on  a  graph  G  =  (V,E). 

The  bond  percolation  process  with  occupation  probabilities  {pe\  e  £  E}  on  graph  G 
is  the  random  process  in  which  each  link  e  £  E  is  independently  declared  “occupied” 
with  probability  pe.  Note  that  in  terms  of  information  diffusion  on  a  network,  the 
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occupied  links  represent  the  links  through  which  the  information  propagates,  and 
the  unoccupied  links  represent  the  links  through  which  the  information  does  not 
propagate.  For  a  positive  integer  M,  we  perform  the  bond  percolation  process  M 
times,  and  sample  a  set  of  M  graphs  constructed  by  the  occupied  links, 

{Gm  =  C V,Em);  m= 

For  any  v  €  F,  we  define  s(i>;  G,  M)  by 

M 

s(v;G,M)  =  —  Y1  \F^G'm)\-  (4) 

m—1 

Here,  for  any  graph  G  =  (F,  E)  and  any  node  v  €  V,  F(v;  G)  stands  for  the  set  of 
all  the  nodes  that  are  reachable  from  node  v  on  graph  G.  We  say  that  node  u  is 
reachable  from  node  v  on  graph  G  if  there  is  a  path  from  u  to  v  along  the  links  on 
graph  G. 

It  is  known  [Newman  2003]  that  the  IC  model  with  propagation  probabilities 
{pe',  e  £  E}  on  graph  G  can  be  exactly  mapped  onto  the  bond  percolation  process 
with  occupation  probabilities  {pe\  e  £  E}  on  graph  G,  and  the  influence  degree 
cr(v;  G)  of  node  v  £  V  can  well  be  approximated  by  s(v;  G,  M), 

cr(v;  G)  ~  s(v;  G,  M),  (v  £  F),  (5) 

if  M  is  sufficiently  large.  We  decompose  each  graph  Gm  into  the  strongly  connected 
components  (SCCs)  as  follows: 

jm 

V=  (J  SCC{uJ-,Gm),  (6) 

j= i 

where  Jm  is  the  number  of  the  strongly  connected  components  of  graph  Gm,  each 
uf  is  an  element  of  F,  and  SCC{u™’]Gm)  denotes  the  SCC  of  graph  Gm  that 
contains  node  uj1.  Note  that 

\F{v,Gm)\  =  \F(u™;Gm)\,  if  v  £  SCC{u™-,  Gm).  (7) 

Thus,  by  calculating  {|F(u™; Gm)|; j  =  in  advance  and  using  Equa¬ 

tion  (7),  we  efficiently  calculate  |F(v;  Gm) \  for  all  v  £  V.  Once  we  have  {|F(u;  Gm) |; 
v  £  V,  m  =  1,  •  •  • ,  M},  we  can  calculate  s(v,  G ,  M)  for  all  v  £  V  from  Equation  (4). 

Namely,  the  bond  percolation  method  estimates  all  the  influence  degrees  {cr(u;  G); 
v  £  V}  on  graph  G  as  follows:  It  first  specifies  the  value  of  integer  M,  calculates 
s(v;  G ,  M)  for  all  v  £  V  by  performing  the  above  procedure,  and  estimates  a(v,  G) 
for  all  v  £  V  by  using  Equation  (5). 

4.3  Estimation  Method 

Now,  we  give  methods  for  efficiently  estimating  e*  £  E  satisfying  Equation  (3)  for 
a  given  graph  G  =  (F,  E)  to  implement  Step  A3  of  the  greedy  algorithm  for  the 
average  and  the  worst  contamination  minimization  problems. 

First,  we  perform  the  bond  percolation  process  M  times  on  graph  G  =  (V,E), 
and  sample  a  set  of  M  graphs  constructed  by  the  occupied  links, 

{Gm  =  (F,  Em);  to  =  1,  •  •  • ,  M}  , 
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where  M  is  a  given  positive  integer.  Next,  we  calculate 


BM{e)  =  {m  G  {1,  •  •  • ,  M};  e  ^  £m} ,  (e  G  E).  (8) 

Note  that  Bm(z)  represents  the  subset  of  the  M  trials  for  the  bond  percolation 
process  on  graph  G  such  that  e  is  not  an  occupied  link. 

Here,  we  consider  performing  the  bond  percolation  process  \Bm((‘)  times  on  the 
graph  G(e)  =  (V,E\  {e})  for  any  e  G  E,  and  sampling  a  set  of  \Bm(g)\  graphs 
constructed  by  the  occupied  links, 

{G(e)m;  m  =  1,  •  •  • ,  \BM{e)\}  ■ 

We  assume  that  M  is  large  enough  so  that  \Bm(z)\  is  also  sufficiently  large.  Then, 
by  Equation  (5),  we  have 

<r(v;G(e))  ~  s(v;G(e),  \BM{e)\) ,  (v  G  V).  (9) 


Note  from  Equation  (4)  that 


s(v;G(e),  \BM(e)\) 


|BM(e)| 


I  Bm(> 


yr  E  \F(v,G(e)m)\,  (veV). 

' '  m—1 


(10) 


In  order  to  efficiently  estimate  {c(G(e));  e  G  E}  without  applying  the  bond 
percolation  method  on  the  graph  G(e)  for  every  e  G  E,  we  alternatively  calculate 


%(be) 


1 

\BM(e)\ 


E  \F(v,Gm)\, 

m€BM(e) 


(v  G  V,  e  G  E), 


(11) 


for  the  graph  G  on  the  basis  of  the  bond  percolation  method.  Since  each  link  of 
graph  G  is  independently  declared  “occupied”  in  the  bond  percolation  process,  we 
can  obtain  the  following  theorem  from  Equations  (8),  (9),  (10)  and  (11). 


Theorem  4.1.  Let  G  =  (V,E)  be  a  graph.  For  every  v  G  V  and  e  G  E,  we  have 


sM(v>e)  ->■  a(v;  G(e)) 


as  M  — >  oo. 


From  Theorem  4.1,  we  can  apply  the  approximation 

a(v;  G(e))  ~  sM(v,  e),  (v  G  V,  e  G  E),  (12) 

for  a  sufficiently  large  M.  Therefore,  by  Equations  (1)  and  (2),  we  propose  esti¬ 
mating  e*  G  E  satisfying  Equation  (3)  for  a  given  graph  G  =  (V,  E)  as  follows: 

e*  =  argmin  (  1  E  ®m(^»  e)  )  (13) 

eeE\\V\^v  ) 

for  the  average  contamination  minimization  problem  (i.e.,  c  =  Cq ),  and 

e*  =  argmin  max  sM(v,e)  )  (14) 

e£E  \vev  J 

for  the  worst  contamination  minimization  problem  (i.e.,  c  —  c+).  Notice  that  for 
the  proposed  method,  the  value  of  M  is  specified  in  advance. 


ACM  Journal  Name,  Vol.  V,  No.  N,  Month  20YY. 


4.4  Computational  Complexity  and  Implementational  Strategy 

For  both  the  average  and  the  worst  contamination  minimization  problems,  we  com¬ 
pare  the  proposed  methods  with  the  naive  greedy  strategy  in  terms  of  computa¬ 
tional  complexity.  We  focus  on  the  computational  complexity  of  estimating  e*  £  E 
satisfying  Equation  (3)  for  a  given  graph  G  =  (V,  E). 

Let  Q  be  the  expected  computational  complexity  for  calculating  the  values  of 
{s(i>;  G,  1);  v  £  V}  on  graph  G  =  (V,E)  on  the  basis  of  the  bond  percolation 
method  (see,  Equation  (4)).  Then,  the  expected  computational  complexity  of  the 
proposed  method  for  calculating  {sM(v,  e);  v  £  V,  e  £  E}  amounts  to  MQ,  since 
the  values  of  {|F(u;  Gm) |;  v  £  V,  m  =  1,  •  •  • ,  M}  are  calculated  on  the  basis  of  the 
bond  percolation  method  (see,  Equations  (4)  and  (11)).  Note  that  for  any  e  £  E, 
calculating  { sM(v,e );  v  £  V}  for  the  proposed  methods  corresponds  to  estimat¬ 
ing  c(G(e))  through  \Bm{&)\  trials  of  the  bond  percolation  process  on  graph  G(e) 
(see,  Equation  (11)).  For  the  naive  greedy  strategy,  we  consider  estimating  c(G(e)) 
through  \Bm{c)\  trials  of  the  bond  percolation  process  on  graph  G(e)  (see,  Equa¬ 
tions  (9)  and  (10)).  Then,  in  order  to  estimate  the  values  of  {c(G(e);  e  £  E},  the 
naive  greedy  strategy  requires  the  computational  complexity  of  QYleeE 
Here  we  assumed  that  the  computational  complexities  of  s(v;  G,  1)  and  s(v;  G(e),  1) 
are  the  same  because  \E\  is  sufficiently  large  in  general.  By  noting  that  the  expected 
value  of  |£>A/(e)|  is  (1  —  pe)M,  the  expected  computational  complexity  of  the  naive 
greedy  strategy  for  estimating  {c(G(e);  e  £  E}  becomes  E(1  —  pe).  Thus, 

we  can  see  that  the  proposed  methods  arc  X)eg£;(l  —  pe)  times  faster  than  the  naive 
greedy  strategy  on  average.  For  instance,  when  the  number  of  links  is  100,  000  and 
each  propagation  probability  pe  for  the  IC  model  is  a  uniform  probability  p  =  0.2, 
the  value  of  J2eeE(  1  —  pe)  is  80,000.  Namely,  the  proposed  methods  can  achieve 
a  great  deal  of  reduction  in  computational  cost,  compared  with  the  naive  greedy 
strategy. 

Furthermore,  the  following  strategies  can  be  used  to  efficiently  find  e*  £  E  satis¬ 
fying  Equations  (13)  or  (14)  for  a  given  graph  G  =  (V,E)  in  actual  practice. 

First,  as  for  the  worst  contamination  minimization  problem,  we  apply  the  idea  of 
lazy  evaluations  for  marginal  increments  of  a  submodular  function  by  Leskovec  et  al. 
[2007].  More  specifically,  we  efficiently  calculate  Equation  (14)  by  appropriately 
pruning  the  evaluations  for  {sM(u,e);  v  £  V ,  e  £  E}.  By  Equations  (4)  and  (11), 
we  have 

M  s(v\  G,  M)  =  \BM(e)\  sM(v,e)  +  £  \F(v,Gm)\ 

for  any  v  £  V  and  e  £  E.  Thus,  we  can  derive  the  following  upper  bound  with 
respect  to  sM(v,  e ): 

M 

s(v,G,M)  >  sM(v,e),  (v  £  V,e  £  E).  (15) 

l#M(e)| 

We  arbitrarily  fix  a  link  e  £  E.  Then,  we  first  sort  all  the  nodes  {v  £  V} 
of  graph  G  by  the  value  Ms(v,G,M)/\Bm{^)\  hr  descending  order  as  follows: 
(vp,  i  =  1,  -  •  • ,  |W|>.  We  next  calculate  the  value  of  sM(v,e)  in  this  order,  until 
the  current  maximum  value  sM(v*,e )  exceeds  the  value  Ms(u,+i;  G,  M)/\Bm{^)\ 
for  the  head  of  the  remaining  nodes.  By  Equation  (15),  this  pruning  guar- 
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antees  that  the  current  maximum  value  attains  the  maximum  without  necessarily 
evaluating  sM(v,e)  for  all  v  €  V.  In  our  experiments,  the  computational  efficiency 
was  greatly  improved  by  using  this  strategy,  just  as  reported  in  [Leskovec  et  al. 
2007], 

Next,  as  for  the  average  contamination  minimization  problem,  we  efficiently  cal¬ 
culate  Equation  (13)  without  evaluating  the  value  of  sM(v,  e)  for  every  pair  of  node 
v  and  link  e.  Our  strategy  is  to  exploit  the  relation 


1 

W\ 


E 

vev 


sM{v,e) 


1 

\BM{e)\ 


m€BM(e)  1  1  vGV 


(v  G  V,e  e  E),  (16) 


(see,  Equation  (11)).  More  specifically,  we  evaluate  J2vev  \F(v;Gm)/ \V\  for  each 
m  on  the  basis  of  the  bond  percolation  method  in  advance  (see,  Equations  (6)  and 
(7)),  and  then  calculate  Equation  (13)  by  evaluating  Ylvev  ^m(vi  e)  f°r  every  e  £  E 
using  Equation  (16). 


5.  EXPERIMENTAL  EVALUATION 

Using  two  large  real  networks  that  exhibit  many  of  the  key  features  of  social  net¬ 
works,  we  experimentally  evaluated  the  performance  of  the  proposed  method. 

5.1  Network  Data 

First,  we  employed  a  trackback  network  of  blogs  because  a  piece  of  information  can 
propagate  from  one  blog  author  to  another  blog  author  through  a  trackback.  Since 
bloggers  (be.,  blog  authors)  discuss  various  topics  and  establish  mutual  communi¬ 
cations  by  putting  trackbacks  on  each  other’s  blogs,  we  regarded  a  link  created  by  a 
trackback  as  a  bidirectional  link.  By  tracing  up  to  ten  steps  back  in  the  trackbacks 
from  the  blog  of  the  theme  “JR  Fukuchiyama  Line  Derailment  Collision”  in  the 
site  “goo”  1 ,  we  collected  a  large  connected  trackback  network  in  May,  2005.  The 
resulting  network  was  a  directed  graph  of  12,047  nodes  and  79,920  links,  which 
features  the  so-called  “power-law”  degree  distribution  that  most  large  real  networks 
exhibit  (see,  Figure  1).  Here,  the  degree  distribution  is  the  distribution  of  the  num¬ 
ber  of  undirected  links  for  every  node.  We  refer  to  this  network  data  as  the  blog 
network. 

Next,  we  employed  a  network  of  people  that  was  derived  from  the  “list  of  peo¬ 
ple”  within  Japanese  Wikipedia.  Specifically,  we  extracted  the  maximal  connected 
component  of  the  undirected  graph  obtained  by  linking  two  people  in  the  “fist  of 
people”  if  they  co-occur  in  six  or  more  Wikipedia  pages,  and  constructed  a  directed 
graph  regarding  those  undirected  links  as  bidirectional  ones.  We  refer  to  this  net¬ 
work  data  as  the  Wikipedia  network.  Here,  the  total  numbers  of  nodes  and  directed 
links  were  9,481  and  245,  044,  respectively.  The  network  also  showed  the  power-law 
degree  distribution  (see,  Figure  2). 

Newman  and  Park  [2003]  observed  that  social  networks  represented  as  undirected 
graphs  generally  have  the  following  two  statistical  properties  that  are  different  from 
non-social  networks.  First,  they  show  positive  correlations  between  the  degrees  of 
adjacent  nodes.  Second,  they  have  much  higher  values  of  the  clustering  coefficient 


1  http: //blog. goo ,ne . jp/usertheme/ 
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degree 


Fig.  1.  The  degree  distribution  for  the  blog  network. 


Fig.  2.  The  degree  distribution  for  the  Wikipedia  network. 

CC  than  the  corresponding  configuration  models  ( i.e .,  random  network  models). 
Here,  the  clustering  coefficient  CC  for  an  undirected  graph  is  defined  by 

3  x  number  of  triangles  on  the  graph 
number  of  connected  triples  of  nodes  ’ 

where  a  “triangle”  means  a  set  of  three  nodes  each  of  which  is  connected  to  each 
other,  and  a  “connected  triple”  means  a  node  connected  directly  to  unordered  other 
pair  nodes.  For  the  undirected  graph  of  the  Wikipedia  network,  the  value  of  CC  of 
the  corresponding  configuration  model  was  0.046,  while  the  actual  measured  value 
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Fig.  3.  The  degree  correlation  for  the  Wikipedia  network. 


of  CC  was  0.39.  Namely,  the  undirected  graph  of  the  Wikipedia  network  had  a 
much  higher  value  of  the  clustering  coefficient  than  the  corresponding  configuration 
model.  Moreover,  we  can  see  from  Figure  3  that  the  Wikipedia  network  had  weakly 
positive  degree  correlation.  Therefore,  we  believe  that  the  Wikipedia  network  is  a 
typical  example  of  a  large  real  social  network  represented  by  an  undirected  graph, 
and  can  be  used  as  the  network  data  to  evaluate  the  performance  of  the  proposed 
method. 

5.2  Experimental  Settings 

For  the  bond  percolation  method,  we  need  to  specify  the  number  M  of  performing 
the  bond  percolation  process.  It  is  reported  [Kimura  et  al.  2007]  that  setting  the 
value  of  M  at  several  thousand  is  good  enough  for  estimating  influence  degrees  for 
the  blog  and  Wikipedia  networks.  The  following  is  the  basis  of  assessing  the  value 
of  M  in  the  experiments  in  this  paper.  We  estimated  the  average  and  the  worst 
contamination  degrees  for  the  two  networks  with  M  =  8, 000  and  M  =  300, 000, 
where  we  assigned  a  uniform  probability  p  to  each  propagation  probability  pe  for  the 
IC  model  (how  the  value  of  p  is  determined  for  each  network  is  described  in  detail  in 
the  next  paragraph).  The  difference  in  the  estimated  average  contamination  degree 
for  M  =  8, 000  and  M  =  300, 000  was  about  0.01%  for  the  blog  network  and  0.02% 
for  the  Wikipedia  network.  Also,  the  corresponding  difference  in  the  estimated 
worst  contamination  degree  was  about  0.02%  for  the  blog  network  and  0.01%  for  the 
Wikipedia  network.  Thus,  we  concluded  that  the  estimated  contamination  degrees 
for  these  networks  with  M  =  8,000  are  comparable  to  those  with  M  =  300,000. 
By  considering  the  assigned  values  of  the  propagation  probabilities,  we  decided  to 
use  M  =  10,000  through  the  experiments. 

Because  we  assigned  a  uniform  probability  p  to  the  propagation  probability  pe 
for  any  directed  link  e  of  a  network,  the  IC  model  had  a  single  parameter  p,  and 
we  determined  the  typical  value  of  p  for  each  of  the  blog  and  Wikipedia  networks, 
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Fig.  4.  Fragmentation  of  the  blog  network  for  the  IC  model.  The  fraction  H  of  the  maximal  SCC 
as  a  function  of  the  propagation  probability  p. 


Fig.  5.  Fragmentation  of  the  Wikipedia  network  for  the  IC  model.  The  fraction  H  of  the  maximal 
SCC  as  a  function  of  the  propagation  probability  p.  The  upper  and  lower  frames  show  the  network 
fragmentation  curves  for  the  whole  range  of  p  and  the  range  of  0.01  <  p  <  0.09,  respectively. 


and  used  them  in  the  experiments.  Let  us  consider  the  bond  percolation  process 
corresponding  to  the  IC  model  with  propagation  probability  p  on  a  graph  G  = 
(V,E).  Let  H  be  the  expected  fraction  of  the  maximal  SCC  in  the  network  con¬ 
structed  by  occupied  links.  H  is  a  function  of  p,  and  as  the  value  of  p  decreases, 
the  value  of  H  decreases.  In  other  words,  as  the  value  of  p  decreases,  the  original 
graph  G  gradually  fragments  into  small  clusters  under  the  corresponding  bond  per- 
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eolation  process.  Figures  4  and  5  show  the  network  fragmentation  curves  for  the 
blog  and  Wikipedia  networks,  respectively.  Note  that  H  — >  1  as  p  — >  1  since  the 
blog  and  Wikipedia  networks  are  strongly  connected.  Here,  given  the  value  of  p, 
we  estimated  H  as  follows  (see,  Equation  (6)): 


where  M  =  10,  000.  We  focus  on  the  point  p*  at  which  the  average  rate  dH/dp  of 
change  of  H  attains  the  maximum,  and  regard  it  as  the  typical  value  of  p  for  the 
network.  Note  that  p*  is  a  critical  point  of  dH/dp ,  and  defines  one  of  the  features 
intrinsic  to  the  network.  From  Figures  4  and  5,  we  estimated  p*  to  be  p*  =  0.2  for 
the  blog  network  and  p*  =  0.03  for  the  Wikipedia  network. 


5.3  Comparison  Methods 

We  compared  the  proposed  method  with  three  other  heuristic  methods.  Two  of 
them  are  based  on  the  well-studied  notions  of  betweenness  and  out-degree  in  the 
field  of  complex  network  theory  and  the  other  one  is  the  crude  baseline  of  block¬ 
ing  links  randomly.  We  refer  to  these  methods  as  betweenness  method,  out-degree 
method  and  random  method,  respectively. 


5.3.1  Betweenness  Method.  The  betweenness  score  bG(e)  of  a  link  e  in  a  graph 
G  =  (V,  E)  is  defined  as  follows: 


Me) 


E 

u,v£V 


nG(e;u,v) 

Ng{u,v) 


where  NG(u,  v)  denotes  the  number  of  the  shortest  paths  from  node  u  to  node 
v  on  graph  G,  and  nG(e;u,v)  denotes  the  number  of  those  paths  that  pass  e. 
Here,  we  set  nG(e;u,v)/NG(u,v)  =  0  if  NG(u,v )  =  0.  Newman  and  Girvan  [2004] 
successfully  extracted  community  structure  in  a  network  using  the  following  link- 
removal  algorithm  based  on  betweenness: 


Bl.  Calculate  betweenness  scores  for  all  links  in  the  network. 

B2.  Find  the  link  with  the  highest  score  and  remove  it  from  the  network. 
B3.  Recalculate  betweenness  scores  for  all  remaining  links. 

B4.  Repeat  from  Step  B2. 


In  particular,  the  notion  of  betweenness  can  be  interpreted  in  terms  of  signals 
traveling  through  a  network.  If  signals  travel  from  source  nodes  to  destination 
nodes  along  the  shortest  paths  in  a  network,  and  all  nodes  send  signals  at  the  same 
constant  rate  to  all  others,  then  the  betweenness  score  of  a  link  is  a  measure  of  the 
rate  at  which  signals  pass  along  the  link.  Thus,  we  naively  expect  that  blocking  the 
links  with  the  highest  betweenness  score  can  be  effective  for  preventing  the  spread 
of  contamination  in  the  network.  Therefore,  we  apply  the  method  of  Newman  and 
Girvan  [2004]  to  the  contamination  minimization  problems. 


5.3.2  Out-degree  Methods.  Previous  work  has  shown  that  simply  removing  nodes 
in  order  of  decreasing  out-degrees  works  well  for  preventing  the  spread  of  contam¬ 
ination  in  most  real  networks  [Albert  et  al.  2000;  Broder  et  al.  2000;  Callaway 
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et  al.  2000;  Newman  et  al.  2002].  Here,  the  out-degree  d(v)  of  a  node  v  means  the 
number  of  outgoing  links  from  the  node  v.  Therefore,  as  a  comparison  method, 
we  consider  the  straightforward  application  of  this  node  removal  method.  Namely, 
we  employ  the  method  of  choosing  nodes  in  decreasing  order  of  out-degree  and 
blocking  simultaneously  all  the  links  attached  to  the  chosen  nodes.  We  refer  to 
this  method  as  the  node  out-degree  method.  Note  that  the  node  out-degree  method 
cannot  be  applied  for  all  values  of  positive  integer  I\  (<  \E\)  to  the  contamination 
minimization  problems  of  blocking  K  links. 

We  also  consider  the  method  of  blocking  links  between  nodes  with  high  out- 
degrees  as  an  alternative  comparison  method.  We  define  the  link  out-degree  d(e) 
of  a  link  e  =  ( u ,  v)  from  node  u  to  node  v  by 

d(e)  =  d(u)d(v), 

and  recursively  block  links  in  decreasing  order  of  link  out-degree.  We  refer  to  this 
method  as  the  link  out-degree  method. 

5.4  Experimental  Results 

We  evaluated  the  performance  of  the  proposed  method  and  compared  it  with  that  of 
the  betweenness,  the  node  out-degree,  the  link  out-degree  and  the  random  methods. 
Clearly,  the  performance  can  be  evaluated  by  the  average  contamination  degree  c$ 
and  the  worst  contamination  degree  c+.  We  estimated  these  values  by  using  the 
bond  percolation  method  with  M  =  10,  000,  that  is, 

co(Gk)  = 

I  I  vev 

c+(Gk)  =  ma xs(v;Gk,  M), 

v€V 

(see,  Equation  (4)),  where  M  =  10,  000.  Note  that  this  evaluation  is  done  separately 
from  the  approximation  used  to  search  for  the  link  to  be  deleted,  i. e.,  Equation  (11). 

5.4.1  Average  Contamination  Minimization  Problem.  Figures  6  and  7  show  the 
average  contamination  degree  Co  as  a  function  of  the  number  K  of  links  blocked 
for  the  blog  network  and  Figures  8  and  9  show  the  corresponding  results  for  the 
Wikipedia  network.  In  these  figures  the  circles,  squares,  diamonds,  triangles  and 
crosses  indicate  the  results  for  the  proposed,  the  betweenness,  the  node  out-degree, 
the  link  out-degree  and  the  random  methods,  respectively.  For  each  dataset,  there 
are  two  figures,  one  comparing  the  proposed  method  with  the  betweenness  method 
and  the  other  comparing  the  proposed  method  at  a  fixed  value  of  K  =  500  with 
the  node  out-degree,  the  link  out-degree  and  the  random  methods. 

First,  note  that  the  average  contamination  degree  Cq  at  K  =  0  is  976  for  the  blog 
network  and  403  for  the  Wikipedia  network,  which  is  8.2%  and  4.2%  respectively. 
The  average  contamination  degree  as  defined  by  Equation  (1)  is  less  than  10%.  The 
fact  that  this  value  for  Wikipedia  network  is  about  half  of  that  of  the  blog  network 
is  explained  by  the  smaller  value  of  p  for  the  Wikipedia  network  with  the  difference 
in  network  sizes  considered.  As  expected  the  proposed  method  performs  the  best 
and  the  betweenness  method  follows.  The  other  three  methods  are  much  worse 
than  these  two  in  the  networks  used. 
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Fig.  6.  Performance  comparison  between  the  proposed  and  the  betweenness  methods  in  the  blog 
network  for  the  average  contamination  minimization  problem. 


Fig.  7.  Performance  comparison  of  the  proposed  method  at  K  =  500  with  the  node  out-degree, 
the  link  out-degree  and  the  random  methods  in  the  blog  network  for  the  average  contamination 
minimization  problem. 


The  number  of  links  blocked:  K  =  500  corresponds  to  0.63%  of  the  total  links  for 
the  blog  network  and  0.2%  for  the  Wikipedia  network.  Inversely,  0.2%  of  the  total 
links  corresponds  to  163  links  for  the  blog  network.  The  average  contamination 
degree  at  0.2%  link  block,  i.e.,  K  =  163  for  the  blog  network  and  K  =  500  for  the 
Wikipedia  network  is  495  and  243  for  the  proposed  method,  which  is  equivalent 
to  49%  and  40%  reduction  in  the  degree,  respectively,  and  607  and  306  for  the 
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Fig.  8.  Performance  comparison  between  the  proposed  and  the  betweenness  methods  in  the 
Wikipedia  network  for  the  average  contamination  minimization  problem. 


Fig.  9.  Performance  comparison  of  the  proposed  method  at  K  =  500  with  the  node  out-degree,  the 
link  out-degree  and  the  random  methods  in  the  Wikipedia  network  for  the  average  contamination 
minimization  problem. 


betweenness  method,  which  is  equivalent  to  38%  and  24%  reduction  in  the  degree, 
respectively.  The  difference  between  the  two  methods  is  11%  for  the  blog  network 
and  16%  for  the  Wikipedia  network,  respectively.  The  average  contamination  de¬ 
gree  at  0.63%  link  block  for  the  blog  network,  i.e.,  K  =  500  is  267  for  the  proposed 
method  and  303  for  the  betweenness  method,  which  is  equivalent  to  73%  and  69% 
reduction  in  the  degree,  respectively,  and  the  difference  between  the  two  methods 
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Fig.  10.  Performance  comparison  between  the  proposed  and  betweenness  methods  in  the  blog 
network  for  the  worst  contamination  minimization  problem. 


is  4%. 

Differently  from  the  above,  the  proposed  method  as  well  as  the  betweenness 
method  outperform  by  far  the  other  three  methods  (the  node  out-degree,  the  link 
out-degree  and  the  random)  for  both  the  blog  and  the  Wikipedia  networks.  Blocking 
500  links  by  the  proposed  methods  is  equivalent  to  blocking  more  than  10,000  links 
for  the  blog  network  and  20,000  links  for  the  Wikipedia  network  by  the  other  three 
methods,  meaning  that  the  proposed  method  is  20  to  40  times  more  effective. 

5.4.2  Worst  Contamination  Minimization  Problem.  Figures  10  and  11  show  the 
worst  contamination  degree  c. (_  as  a  function  of  the  number  K  of  links  blocked  for 
the  blog  network,  and  Figures  12  and  13  show  the  corresponding  results  for  the 
Wikipedia  network.  The  meaning  of  the  symbols  in  captions  and  the  layout  of  the 
figures  are  the  same  as  before. 

First  note  that  the  worst  contamination  degree  c+  at  K  =  0  is  3218  for  the  blog 
network  and  1929  for  the  Wikipedia  network,  which  is  27%  and  20%  respectively. 
They  are  about  3  and  5  times  larger  than  the  average  contamination  degrees.  The 
difference  of  the  values  between  the  two  networks  is  consistent  with  the  average 
contamination  case.  The  overall  performance  difference  among  the  four  methods  is 
also  consistent  with  the  average  contamination  case. 

The  worst  contamination  degree  at  0.2%  link  block,  i.e.,  K  =  163  for  the  blog 
network  and  K  =  500  for  the  Wikipedia  is  1763  and  1177  for  the  proposed  method, 
which  is  equivalent  to  45%  and  39%  reduction  in  the  degree,  respectively,  and 
2455  and  1700  for  the  betweenness  method,  which  is  equivalent  to  24%  and  12% 
reduction  in  the  degree,  respectively.  The  difference  between  the  two  methods  is 
21%  for  the  blog  network  and  27%  for  the  Wikipedia  network,  respectively.  The 
worst  contamination  degree  at  0.63%  link  block  for  the  blog  network,  i.e.,  K  =  500 
is  1045  for  the  proposed  method  and  1193  for  the  betweenness  method,  which  is 
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Fig.  11.  Performance  comparison  of  the  proposed  method  for  K  =  500  with  the  node  out-degree, 
link  out-degree  and  random  methods  in  the  blog  network  for  the  worst  contamination  minimization 
problem. 


Fig.  12.  Performance  comparison  between  the  proposed  and  betweenness  methods  in  the 
Wikipedia  network  for  the  worst  contamination  minimization  problem. 


equivalent  to  78%  and  63%  reduction  in  the  degree,  respectively,  and  the  difference 
between  the  two  methods  is  15%. 

Again  differently  from  the  above,  the  proposed  method  as  well  as  the  betweenness 
method  outperform  by  far  the  other  three  methods  (the  node  out-degree,  the  link 
out-degree  and  the  random)  for  both  the  blog  and  the  Wikipedia  networks.  Blocking 
500  links  by  the  proposed  method  is  equivalent  to  blocking  more  than  10,000  links 
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Fig.  13.  Performance  comparison  of  the  proposed  method  for  K  =  500  with  the  node  out- 
degree,  link  out-degree  and  random  methods  in  the  Wikipedia  network  for  the  worst  contamination 
minimization  problem. 


for  the  blog  network  and  30,000  links  by  the  other  three  methods,  meaning  that 
the  proposed  method  is  20  to  60  times  more  effective. 

5.4.3  Discussion.  These  results  imply  that  the  proposed  method  works  effec¬ 
tively  as  expected,  and  outperforms  the  conventional  link-removal  heuristics.  There 
is  no  big  difference  in  the  comparative  performance  results  between  the  two  net¬ 
works.  For  both  of  them,  the  betweenness  method  performs  reasonably  well  but 
the  other  three  methods  (the  node  out-degree,  the  link  out-degree  and  the  random) 
perform  very  poorly.  There  is  no  out-degree  myth  observed. 

Of  course  how  each  of  the  conventional  link-heuristics  performs  depends  on  the 
characteristics  of  the  network  structure.  In  general  a  network  consists  of  multiple 
communities,  and  the  members  of  each  community  are  tightly  connected  and  the 
members  of  different  communities  are  less  tightly  connected.  Thus,  it  is  reasonable 
to  assume  that  blocking  the  links  between  the  different  communities  is  effective 
in  suppressing  the  contaminant  to  diffuse  from  one  community  to  others.  This  is 
particularly  true  when  there  is  a  small  number  of  nodes  that  play  a  key  role  of 
connecting  different  communities.  Blocking  these  small  number  of  paths  is  quite 
effective.  The  fact  that  the  betweenness  method  performed  reasonably  well  implies 
that  the  networks  we  analyzed  may  have  this  type  of  community  structure.  On 
the  other  hand,  if  the  network  is  hierarchically  structured,  blocking  the  nodes, 
equivalently  blocking  the  links  attached  to  them,  in  the  upper  hierarchy  should  be 
quite  effective.  The  fact  that  the  node  out-degree  method  does  not  do  well  suggests 
that  there  may  not  be  such  a  structure  in  the  networks  we  analyzed.  Among 
the  poorly  performing  three  methods,  the  link  out-degree  method  performs  most 
poorly.  It  performs  worse  than  the  random  methods  for  the  blog  network.  This 
would  indicate  that  it  is  mainly  blocking  the  links  within  the  communities. 
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With  all  these  different  factors  affecting  the  performance  of  each  method  taken, 
the  proposed  method  exhibits  its  strength  of  explicitly  minimizing  the  contamina¬ 
tion  by  considering  the  dynamics  of  information  diffusion  process,  thereby  making 
its  performance  less  sensitive  to  the  structure  of  the  network. 

Considering  the  fact  that  all  the  methods  can  eventually  block  the  contamination 
when  all  of  the  links  are  blocked,  it  is  important  to  have  a  method  which  is  effective 
when  the  number  of  links  to  be  blocked  is  limited  to  be  small,  and  the  proposed 
method  has  this  property.  It  is  noticeable  that  blocking  only  0.2%  of  the  links  by 
the  proposed  method  can  reduce  the  contamination  by  nearly  50%. 

We  have  devised  two  measures:  the  average  contamination  degree  and  the  worst 
contamination  degree.  It  is  expected  that  the  performance  difference  between  the 
proposed  method  and  the  betweenness  method  is  larger  for  the  latter  than  the 
former,  and  the  results  is  consistent.  Our  formulation  does  not  assume  the  origins 
of  contamination  to  be  known  and  fixed.  If  they  are  known  in  advance,  the  problem 
is  much  easier  computationally. 


6.  CONCLUSION 

Just  as  good  things,  e.g .,  innovation,  important  topics,  etc.  spread  through  a  net¬ 
work  and  bring  positive  affects  to  people,  undesirable  things,  e.g.,  computer  virus, 
malicious  rumors,  etc.  also  spread  and  affect  people  badly.  We  addressed  the  prob¬ 
lem  of  minimizing  the  spread  of  undesirable  things  by  blocking  links  in  a  social 
network,  which  is  converse  to  the  influence  maximization  problem  for  the  same 
network.  In  particular,  we  have  considered  two  contamination  minimization  prob¬ 
lems,  one  minimizing  the  average  contamination  degree  and  the  other  minimizing 
the  worst  (maximum)  contamination  degree.  We  chose  to  block  “links”  rather  than 
“nodes”  because  deleting  nodes  necessitates  deleting  links,  but  not  vice  versa. 

We  have  proposed  novel  methods  for  efficiently  finding  good  approximate  solu¬ 
tions  to  these  problems  on  the  basis  of  a  naturally  greedy  algorithm  and  the  bond 
percolation  method.  Using  large-scale  blog  and  Wikipedia  networks,  we  have  ex¬ 
perimentally  demonstrated  that  the  proposed  method  works  effectively,  and  also 
outperforms  the  conventional  link-removal  heuristics.  The  betweenness  method 
performed  reasonably  well  but  the  out-degree  methods  performed  very  poorly  al¬ 
most  as  badly  as  the  random  method.  No  out-degree  myth  was  observed  for  the 
networks  we  analyzed.  The  performance  of  the  link-removal  heuristics  is  strongly 
affected  by  the  network  structure,  but  the  proposed  method  shows  that  it  is  im¬ 
portant  to  explicitly  minimize  the  contamination  by  considering  the  dynamics  of 
information  diffusion  process,  which  would  make  the  performance  less  sensitive  to 
the  structure  of  the  network. 
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Abstract  We  address  the  problem  of  ranking  influential  nodes  in  complex  social 
networks  by  estimating  diffusion  probabilities  from  observed  information  diffusion 
data  using  the  popular  independent  cascade  (IC)  model.  For  this  purpose  we  for¬ 
mulate  the  likelihood  for  information  diffusion  data  which  is  a  set  of  time  sequence 
data  of  active  nodes  and  propose  an  iterative  method  to  search  for  the  probabilities 
that  maximizes  this  likelihood.  We  apply  this  to  two  real  world  social  networks  in 
the  simplest  setting  where  the  probability  is  uniform  for  all  the  links,  and  show  that 
the  accuracy  of  the  probability  is  outstandingly  good,  and  further  show  that  the  pro¬ 
posed  method  can  predict  the  high  ranked  influential  nodes  much  more  accurately 
than  the  well  studied  conventional  four  heuristic  methods. 


1  Introduction 

Innovation,  hot  topics  and  even  malicious  rumors  can  propagate  through  social  net¬ 
works  among  people  in  the  form  of  so-called  “word-of-mouth”  communications. 
The  rise  of  the  Internet  and  the  World  Wide  Web  accelerates  the  creation  of  vari¬ 
ous  large-scale  social  networks.  Therefore,  considerable  attention  has  recently  been 
devoted  to  social  networks  as  an  important  medium  for  the  spread  of  information. 

Previous  work  addressed  the  problem  of  tracking  the  propagation  patterns  of 
topics  or  influence  through  blogspace  [1,5,  10],  and  studied  strategies  for  removing 
nodes  to  prevent  the  spread  of  some  undesriable  information  through  a  network,  for 
example,  the  spread  of  a  computer  virus  through  an  email  network  [2,  1 1],  A  widely- 
used  fundamental  probabilistic  model  of  information  diffusion  through  a  network  is 
the  independent  cascade  (IC)  model  [6,  5],  Using  this  model,  the  problem  of  finding 
a  limited  number  of  nodes  that  are  effective  for  the  spread  of  information  [6,  8]  have 
been  extensively  investigated.  This  combinatorial  optimization  problem  is  called  the 
influence  maximization  problem.  This  problem  was  also  investigated  in  a  different 
setting  (a  descriptive  probabilistic  model  of  interaction)  [4,  13].  Further,  yet  another 
problem  of  minimizing  the  spread  of  undesirable  information  by  blocking  a  limited 
number  of  links  in  a  network  [9]  has  recently  been  addressed.  In  this  paper,  we  also 
explore  information  diffusion  phenomena  for  the  IC  model  in  a  given  network. 


Overall,  finding  influential  nodes  in  a  social  network  is  one  of  the  most  central 
problems  in  the  field  of  social  network  analysis.  There  exist  several  methods  for 
ranking  nodes  on  the  basis  of  the  network  structure  [15].  We  also  address  this  prob¬ 
lem,  but  from  a  different  angle.  We  propose  a  method  for  extracting  influential  nodes 
by  ranking  nodes  in  terms  of  influence  degrees  for  the  IC  model  on  the  basis  of  the 
observed  data  of  information  diffusion  in  the  network.  The  IC  model  is  equipped 
with  parameters.  More  specifically,  the  diffusion  probability  must  be  specified  for 
each  link  in  the  network  in  advance.  We  estimate  the  probabilities  so  that  the  like¬ 
lihood  of  obtaining  the  observed  set  of  information  diffusion  data  is  maximized  by 
an  iterative  algorithm  (EM  algorithm).  Using  two  real  world  networks:  the  blog  and 
Wikipedia  networks,  we  first  evaluate  the  accuracy  of  the  diffusion  probabilities  and 
then  use  the  estimated  model  to  find  the  influential  nodes  and  compare  the  results 
with  the  ground  truth  as  well  as  the  results  that  are  obtained  by  using  four  strategies, 
each  with  a  different  heuristic,  showing  that  the  proposed  method  far  outperforms 
the  conventional  methods. 

The  rest  of  the  paper  is  organized  as  follows.  The  proposed  method  is  formulated 
as  a  machine  learning  problem  in  section  2,  and  the  experimental  results  together 
with  the  experimental  settings  are  given  in  section  3,  followed  by  some  discussion 
of  how  the  probabilities  affect  the  influential  nodes  in  section  4.  We  conclude  this 
paper  by  summarizing  our  findings  in  section  5. 


2  Proposed  Method 

2.1  Problem  Formulation  and  Extraction  Method 

For  a  given  directed  network  (or  equivalently  graph)  G  =  (V,E),  let  V  be  a  set  of 
nodes  (or  vertices)  and  E  a  set  of  links  (or  edges),  where  we  denote  each  link  by 
e  =  (v,  w)  €  E  and  v  /  w,  meaning  there  exists  a  directed  link  from  a  node  v  to  a 
node  w.  For  each  node  v  in  the  network  G,  we  denote  F(  v)  as  a  set  of  child  nodes 
of  v  as  follows:  F(v)  =  {tv;  (v,w)  ££}.  Similarly,  we  denote  B(y)  as  a  set  of  parent 
nodes  of  v  as  follows:  B(v)  =  {u;  (u,v)  £  E}. 

In  the  IC  model,  for  each  directed  link  e  =  (v,w),  we  specify  a  real  value  pv  w 
with  0  <  pVAV  <  1  in  advance.  Here  pv  w  is  referred  to  as  the  diffusion  probability  of 
link  (v,w).  The  diffusion  process  proceeds  from  a  given  initial  active  set  D(0)  in  the 
following  way.  When  a  node  v  first  becomes  active  at  time-step  t,  it  is  given  a  single 
chance  to  activate  each  currently  inactive  child  node  w,  and  succeeds  with  proba¬ 
bility  pv  w.  If  v  succeeds,  then  w  will  become  active  at  time-step  t+  1.  If  multiple 
parent  nodes  of  w  first  become  active  at  time-step  t,  then  their  activation  attempts 
are  sequenced  in  an  arbitrary  order,  but  all  performed  at  time-step  t.  Whether  or  not 
v  succeeds,  it  cannot  make  any  further  attempts  to  activate  w  in  subsequent  rounds. 
The  process  terminates  if  no  more  activations  are  possible. 
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For  a  given  set  of  diffusion  probabilities,  0  =  {/;luv;  (v.  w)  £  E}.  and  an  initial 
active  node  v,  we  define  the  influence  degree ,  denoted  by  <j(v;@),  as  the  expected 
number  of  active  nodes.  Our  problem  of  finding  influential  nodes  is  formulated  as  a 
node  ranking  problem  based  on  the  influence  degree  cr(i';  ©o),  where  ©o  means  a  set 
of  the  true  diffusion  probabilities.  In  practice  settings,  however,  the  true  diffusion 
probability  set  ©o  is  not  available.  Thus,  we  consider  to  utilize  their  probabilities 
0  estimated  from  past  information  diffusion  histories  observed  as  sets  of  active 
nodes.  Then  we  need  to  evaluate  the  ranking  similarity  between  two  sorted  node 
lists  according  to  <7(v;©o)  and  cr(v;0). 


2.2  Probability  Estimation  Method 

Let  D  =  D(0)  UD(1)  U  •  •  •  U D(T)  be  an  information  diffusion  result,  where  D(t )  is 
the  set  of  nodes  that  have  become  active  at  time  t.  When  v  £  Oil)  and  w  £  D(t  +  1) 
D  F(y)  hold  for  some  link  e  =  (v,w),  it  is  possible  that  the  node  v'  succeeded  in 
activating  the  node  w  via  the  link  e.  However,  since  we  should  consider  possibilities 
that  some  other  nodes  v1  £  D(t)  P  B(w)  also  succeeded  in  activating  the  node  w,  we 
need  to  calculate  the  probability  that  the  node  w  becomes  active  at  time  t  +  1  as 
follows:  P(w;t  +  1)  =  1  -  rivefl(w)nD(/)  ( 1  ~  Pv,w )■  Here  note  that  if  w  £  D(f+  1),  it 
is  guaranteed  that  D(t)  P  B{w)  ^  0. 

We  set  C(f)  =  D( 0)  U  •  •  •  UD(t).  Note  that  C(f)  is  the  set  of  active  nodes  at  time 
t.  When  v  £  D[t )  and  w  £  F(v)  \  C(t  +  1)  hold,  we  know  that  the  node  v'  definitely 
failed  to  activate  the  node  w  via  the  link  e.  Clearly,  when  v  £  D{t)  and  w  £  F(v)  P 
C(t )  hold,  as  well  as  v  ^  D,  no  information  is  available  about  the  trial  with  respect 
to  the  link  e  =  (v,w).  Therefore,  we  can  define  the  likelihood  function  with  respect 
to  0  =  as  follows: 

&(e;D)= n  n  (i-  n  a-^nrin  n 

t=0  w£D(t+l)  \  v£B(w)r\D(t)  /  t=0veD(t)  w£F(v)\C(t+l) 

Let  {Dm;  1  <  m  <  M}  be  an  observed  data  set  of  M  independent  information 
diffusion  results.  Then  we  can  define  the  following  objective  function  with  respect 
to  0: 

M 

J{&)  =  E  log (1) 

m=  1 

Thus,  our  problem  is  to  obtain  the  set  of  information  diffusion  probabilities  0, 
which  maximizes  Equation  (1).  For  this  estimation  problem,  we  have  already  pro¬ 
posed  an  estimation  method  based  on  the  Expectation-Maximization  algorithm  in 
order  to  stably  obtain  its  solutions  [14], 

In  order  to  evaluate  fundamental  abilities  of  our  method,  in  this  paper,  we  con¬ 
sider  the  simplest  case  that  all  links  have  the  same  diffusion  probability  p.  Note 
that  this  problem  setting  has  been  widely  adopted  in  many  previous  experiments 


[6,  8,  9],  and  the  formulation  is  valid  for  more  general  cases  in  which  there  is  no 
such  restriction. 


3  Experiments 

3.1  Experimental  Settings 

We  employed  two  sets  of  large  real  networks  used  in  [9],  the  blog  and  Wikipedia 
networks,  which  exhibit  many  of  the  key  features  of  social  networks.  These  are 
bidirectional  networks.  The  blog  network  had  12,047  nodes  and  79,920  directed 
links,  and  the  Wikipedia  network  had  9,481  nodes  and  245,044  directed  links.  As 
stated  before,  in  our  preliminary  experiments,  we  assumed  the  simplest  case  where 
the  diffusion  probability  is  uniform  throughout  the  network,  and  set  the  value  p  as 
follows:  p  =  0.1  for  the  blog  network  and  p  =  0.01  for  the  Wikipedia  network. 
We  evaluated  the  influence  degrees  {cr(v);  v  £  V}  using  the  method  of  [8]  with 
the  parameter  value  10,000,  where  the  parameter  represents  the  number  of  bond 
percolation  processes  (we  do  not  describe  the  method  here  due  to  the  page  limit). 
The  average  value  and  the  standard  deviation  of  the  influence  degrees  was  87.5  and 
131  for  the  blog  network,  and  8.14  and  18.4  for  the  Wikipedia  network. 

In  the  learning  stage,  a  training  sample  was  an  information  diffusion  path  D  = 
D(0)UD(1)  U  *  *  *  U  D(T )  which  is  a  sequence  of  the  active  nodes  starting  from  a 
randomly  selected  initial  active  node.  We  used  M  training  samples  for  learning  the 
propagation  probability,  where  M  is  a  parameter. 


3.2  Comparison  Methods 

We  compared  the  proposed  method  with  four  heuristics  from  social  network  analysis 
with  respect  to  the  predictive  capability  of  high  ranked  influential  nodes. 

First,  “degree  centrality”,  “closeness  centrality”,  and  “betweenness  centrality” 
are  commonly  used  as  influence  measure  in  sociology  [15],  where  the  degree  of 
node  v;  is  defined  as  the  number  of  links  attached  to  v,  the  closeness  of  node  v  is 
defined  as  the  reciprocal  of  the  average  distance  between  v  and  other  nodes  in  the 
network,  and  the  betweenness  of  node  v  is  defined  as  the  total  number  of  shortest 
paths  between  pairs  of  nodes  that  pass  through  v. 

We  also  consider  measuring  the  influence  of  each  node  by  its  “authoritativeness” 
obtained  by  the  “PageRank”  method  [3],  since  this  is  a  well  known  method  for 
identifying  authoritative  or  influential  pages  in  a  hyperlink  network  of  web  pages. 
This  method  has  a  parameter  e;  when  we  view  it  as  a  model  of  a  random  web 
surfer,  e  corresponds  to  the  probability  with  which  a  surfer  jumps  to  a  page  picked 
uniformly  at  random  [12].  In  our  experiments,  we  used  a  typical  setting  of  e  =  0.15. 
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3.3  Experimental  Results 

First,  we  examined  the  learning  performance  of  propagation  probability  by  the  pro¬ 
posed  method.  Let  po  be  the  true  value  of  propagation  probability,  and  let  p  be  the 
value  of  propagation  probability  estimated  by  the  proposed  method.  We  evaluated 
the  learning  performance  in  terms  of  the  error  rate  S  =  \po  —  p\/po- 


Table  1  Learing  performance  of  propagation  probability. 


Results  for  the  blog  network  Results  for  the  Wikipedia  network 


M 

S 

M 

S 

20 

0.036  (0.024) 

20 

0.138  (0.081) 

40 

0.018  (0.014) 

40 

0.109  (0.066) 

60 

0.016  (0.007) 

60 

0.080  (0.041) 

80 

0.009  (0.006) 

80 

0.047  (0.018) 

100 

0.006  (0.004) 

100 

0.021  (0.013) 

Table  1  shows  the  average  value  of  $  and  the  standard  deviation  in  parenthesis  for 
the  number  of  training  samples,  M,  where  we  performed  the  same  experiment  five 
times  independently.  Our  algorithm  can  converge  to  the  true  value  efficiently  when 
there  is  a  reasonable  amount  of  training  data.  The  results  are  better  for  a  larger  value 
of  diffusion  probability.  The  results  demonstrate  the  effectiveness  of  the  proposed 
method. 


Fig.  1  Performance  comparison  in  extracting  influential  nodes  for  the  blog  network. 


Next,  in  terms  of  ranking  for  extracting  influential  nodes  from  the  network 
G  =  (V,E),  we  compared  the  proposed  method  with  the  out-degree,  the  between¬ 
ness,  the  closeness,  and  the  PageRank  methods.  For  any  positive  integer  r  (<  \v\). 


Fig.  2  Performance  comparison  in  extracting  influential  nodes  for  the  Wikipedia  network. 


let  Lo(r)  be  the  true  set  of  top  r  nodes,  and  let  L(r)  be  the  set  of  top  r  nodes  for  a 
given  ranking  method.  We  evaluated  the  performance  of  the  ranking  method  by  the 
ranking  similarity  F{r)  at  rank  r,  where  F(r)  is  defined  by  F(r)  =  \Lq(t)  m.(r)|/r. 
We  focused  on  ranking  similarities  at  high  ranks  since  we  are  interested  in  extracting 
influential  nodes.  Figures  1  and  2  show  the  results  for  the  blog  and  the  Wikipedia 
networks,  respectively.  Here,  circles,  triangles,  diamonds,  squares,  and  asterisks  in¬ 
dicate  ranking  similarity  F(r)  as  a  function  of  rank  r  for  the  proposed,  the  out- 
degree,  the  betweenness,  the  closeness,  and  the  PageRank  methods,  respectively. 
For  the  proposed  method,  we  plotted  the  average  value  of  F(r)  at  r  for  five  ex¬ 
perimental  results  in  the  case  of  M  =  100.  The  proposed  method  gives  far  better 
results  than  the  other  heuristic  based  methods  for  the  both  networks  demonstrating 
the  effectiveness  of  the  proposed  method. 


4  Discussion 

We  consider  that  our  proposed  ranking  method  presents  a  novel  concept  of  cen¬ 
trality  based  on  the  information  diffusion  model,  i.e.,  the  1C  model.  Actually,  Fig¬ 
ures  1  and  2  show  that  nodes  identified  as  higher  ranked  by  our  method  are  sub¬ 
stantially  different  from  those  by  each  of  the  conventional  methods.  This  means  that 
our  method  enables  a  new  type  of  social  network  analysis  if  past  information  diffu¬ 
sion  data  are  available.  Of  course,  it  is  beyond  controversy  that  each  conventional 
method  has  its  own  merit  and  usage,  and  our  method  is  an  addition  to  them  which 
has  a  different  merit  in  terms  of  information  diffusion. 

Here,  we  do  some  simple  analysis  of  explaining  why  it  is  important  to  know 
the  diffusion  probability  in  finding  the  influential  nodes.  If  the  probability  does  not 
affect  the  ranking,  we  don’t  care  about  its  absolute  value.  However,  a  simple  anal- 
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Fig.  3  An  example  of  network. 


ysis  reveals  that  it  does  affect  the  node  ranking.  Note  that  <7(v; p)  is  a  monotoni- 
cally  increasing  non  negative  function  of  p  if  v’s  out  degree  is  non  zero.  Assume 
that  there  are  two  such  nodes  v  and  w  that  have  the  following  graph  structures: 
(v,v1),(v,V2),(v,v3)  e£and  (w,w2),  (w1;w3),  (w2,w3)  G  E  (see  Fig.  3).  The 

maximum  influential  degree  is  3  for  the  both  nodes  v  and  w.  The  expected  values 
are  easily  calculated  [7]  as  o(v;p)  =  3 p,  and  a{w\p)  =  2p  +  (1  —  (1  —  p2)2)  = 
2p  +  2p2—p4.  Thus,  G(v;p)  —  &(w;p)  =  p(\  —  p)(l  —  p  —  p2).  From  this,  if 
p  <  (-1  +  \/5)/2 ,G(v;p)  >  G{  w\p).  Otherwise,  o(v\p)  <  o(w\p).  Intuitively,  as 
p  gets  larger,  the  influential  probability  of  the  nodes  reachable  in  two  steps  from  the 
starting  node  becomes  larger  than  that  of  the  nodes  reachable  in  one  step,  and  thus, 
w  that  has  child  nodes  in  two  steps  downward  has  a  larger  influential  degree.  Since 
in  general  there  are  many  subnetworks  like  these  within  a  network,  it  is  important 
to  estimate  the  diffusion  probabilities  as  accurately  as  possible.  We  believe  that  the 
methods  proposed  in  this  paper  would  contribute  to  various  types  of  social  network 
analyses. 

We  note  that  the  analysis  we  showed  in  this  paper  is  the  simplest  case  where 
p  takes  a  single  value  for  all  the  links  in  E.  However,  the  method  is  very  general. 
In  a  more  realistic  setting  we  can  divide  E  into  subsets  E\.Ei.  ...,£/v  and  assign 
a  different  value  pn  for  all  the  links  in  each  En.  For  example,  we  may  divide  the 
nodes  into  two  groups:  those  that  strongly  influence  others  and  those  not,  or  we  may 
divide  the  nodes  into  another  two  groups:  those  that  are  easily  influenced  by  others 
and  those  not.  We  can  further  divide  the  nodes  into  multiple  groups.  If  there  is  some 
background  knowledge  about  the  node  grouping,  our  method  can  make  the  best 
use  of  it,  one  of  the  characteristics  of  the  artificial  intelligence  approach.  Obtaining 
such  background  knowledge  is  also  an  important  research  topic  in  the  knowledge 
discovery  from  social  networks. 


5  Conclusion 

We  addressed  the  problem  of  ranking  influential  nodes  in  complex  social  networks, 
given  the  network  topology  and  the  observation  data  of  information  diffusion.  We 


formulated  how  to  estimate  the  diffusion  probability  of  each  link  from  the  past  infor¬ 
mation  diffusion  histories  observed  as  sets  of  active  nodes  using  the  popular  infor¬ 
mation  diffusion  model,  IC  model  as  a  likelihood  maximization  problem  and  derived 
an  efficient  iterative  EM  method  to  solve  it.  The  results  we  obtained  by  applying  to 
two  real  world  networks  in  the  simplest  setting  where  the  probability  is  uniform 
throughout  each  network  show  that  1)  the  method  can  estimate  the  probability  accu¬ 
rately  when  there  is  enough  number  of  observation  sequence  data  that  can  be  used 
for  training  and  2)  the  ranking  of  influential  nodes  predicted  by  the  method  far  out¬ 
performs  the  other  well  known  heuristic  based  methods  (degree  centrality,  closeness 
centrality,  betweenness  centrality,  and  authoritativeness). 
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Abstract 

We  address  the  problem  of  efficiently  estimat¬ 
ing  the  influence  function  of  initially  activated 
nodes  in  a  social  network  under  the  suscepti¬ 
ble/infected/susceptible  (SIS)  model ,  a  diffusion 
model  where  nodes  are  allowed  to  be  activated  mul¬ 
tiple  times.  The  computational  complexity  drasti¬ 
cally  increases  because  of  this  multiple  activation 
property.  We  solve  this  problem  by  constructing  a 
layered  graph  from  the  original  social  network  with 
each  layer  added  on  top  as  the  time  proceeds,  and 
applying  the  bond  percolation  with  a  pruning  strat¬ 
egy.  We  show  that  the  computational  complexity  of 
the  proposed  method  is  much  smaller  than  the  con¬ 
ventional  naive  probabilistic  simulation  method  by 
a  theoretical  analysis  and  confirm  this  by  applying 
the  proposed  method  to  two  real  world  networks. 

1  Introduction 

Social  networks  mediate  the  spread  of  various  information 
including  topics,  ideas  and  even  (computer)  viruses.  The 
proliferation  of  emails,  blogs  and  social  networking  services 
(SNS)  in  the  World  Wide  Web  accelerates  the  creation  of 
large  social  networks.  Therefore,  substantial  attention  has 
recently  been  directed  to  investigating  information  diffusion 
phenomena  in  social  networks  [Adar  and  Adamic,  2005; 
Leskovec  et  al.,  2007b;  Agarwal  and  Liu,  2008]. 

Overall,  finding  influential  nodes  is  one  of  the  most  cen¬ 
tral  problems  in  social  network  analysis.  Thus,  develop¬ 
ing  methods  to  do  this  on  the  basis  of  information  diffu¬ 
sion  is  an  important  research  issue.  Widely-used  funda¬ 
mental  probabilistic  models  of  information  diffusion  are  the 
independent  cascade  (IC)  model  and  the  linear  threshold 
(LT)  model  [Kempe  et  al,  2003;  Gruhl  et  al.,  2004].  Re¬ 
searchers  investigated  the  problem  of  finding  a  limited  num¬ 
ber  of  influential  nodes  that  are  effective  for  the  spread  of 
information  under  the  above  models  [Kempe  et  al.,  2003; 
Kimura  et  al.,  2007].  This  combinatorial  optimization  prob¬ 
lem  is  called  the  influence  maximization  problem.  Kempe 
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et  al.  [2003]  experimentally  showed  on  large  collaboration 
networks  that  the  greedy  algorithm  can  give  a  good  approx¬ 
imate  solution  to  this  problem,  and  mathematically  proved  a 
performance  guarantee  of  the  greedy  solution  (i.e.,  the  solu¬ 
tion  obtained  by  the  greedy  algorithm).  Recently,  methods 
based  on  bond  percolation  [Kimura  et  al.,  2007]  and  sub¬ 
modularity  [Leskovec  et  al.,  2007a]  were  proposed  for  effi¬ 
ciently  estimating  the  greedy  solution.  The  influence  max¬ 
imization  problem  has  applications  in  sociology  and  “vi¬ 
ral  marketing”  [Agarwal  and  Liu,  2008],  and  was  also  in¬ 
vestigated  in  a  different  setting  (a  descriptive  probabilis¬ 
tic  model  of  interaction)  [Domingos  and  Richardson,  2001; 
Richardson  and  Domingos,  2002].  The  problem  has  recently 
been  extended  to  influence  control  problems  such  as  a  con¬ 
tamination  minimization  problem  [Kimura  et  al.,  2009]. 

The  IC  model  can  be  identified  with  the  so-called  suscep¬ 
tible/infected/recovered  (SIR)  model  for  the  spread  of  a  dis¬ 
ease  [Newman,  2003;  Gruhl  et  al.,  2004].  In  the  SIR  model, 
only  infected  individuals  can  infect  susceptible  individuals, 
while  recovered  individuals  can  neither  infect  nor  be  infected. 
This  implies  that  an  individual  is  never  infected  with  the  dis¬ 
ease  multiple  times.  This  property  holds  true  for  the  LT 
model  as  well.  However,  there  exist  phenomena  for  which 
the  property  does  not  hold.  For  example,  consider  the  follow¬ 
ing  propagation  phenomenon  of  a  topic  in  the  blogosphere: 
A  blogger  who  has  not  yet  posted  a  message  about  the  topic 
is  interested  in  the  topic  by  reading  the  blog  of  a  friend,  and 
posts  a  message  about  it  (i.e.,  becoming  infected).  Next,  the 
same  blogger  reads  a  new  message  about  the  topic  posted  by 
some  other  friend,  and  may  post  a  message  (i.e.,  becoming 
infected)  again.  Most  simply,  this  phenomenon  can  be  mod¬ 
eled  by  an  susceptible/infected/susceptible  (SIS)  model  from 
the  epidemiology.  Like  this  example,  there  are  many  exam¬ 
ples  of  information  diffusion  phenomena  for  which  the  SIS 
model  is  more  appropriate,  including  the  growth  of  hyper-link 
posts  among  bloggers  [Leskovec  et  al.,  2007b],  the  spread 
of  computer  viruses  without  permanent  virus-checking  pro¬ 
grams,  and  epidemic  disease  such  as  tuberculosis  and  gonor¬ 
rhea  [Newman,  2003].  In  this  paper,  we  focus  on  an  informa¬ 
tion  diffusion  process  in  a  social  network  over  a  given  time 
span  on  the  basis  of  an  SIS  model. 

Here,  the  SIS  model  is  a  stochastic  process  model,  and  the 
influence  of  a  node  v  at  time-step  t,  <r(v,  t ),  is  defined  as  the 
expected  number  of  infected  nodes  at  time-step  t  when  v  is 


initially  infected  at  time-step  t  =  0.  We  refer  to  er  as  the 
influence  function  for  the  SIS  model.  Developing  an  effec¬ 
tive  method  for  estimating  er  is  vital  for  various  applications. 
Clearly,  in  order  to  extract  influential  nodes,  we  must  estimate 
the  value  of  a(v,  t)  for  every  node  v  and  time-step  t.  More¬ 
over,  note  that  the  method  developed  can  be  easily  extended 
and  applied  to  approximately  solving  the  influence  maximiza¬ 
tion  problem  for  the  SIS  model  by  the  greedy  alogrithm.  We 
can  naively  estimate  er  by  simulating  the  SIS  model.  How¬ 
ever,  this  naive  method  is  overly  inefficient  and  not  practical 
at  all  as  shown  in  the  experiments.  In  this  paper,  we  propose 
a  method  for  estimating  influence  function  er  efficiently.  By 
theoretically  comparing  computational  complexity  with  the 
naive  method,  we  show  that  the  proposed  method  is  expected 
to  achieve  a  large  reduction  in  computational  cost.  Further, 
using  two  large  real  networks,  we  experimentally  demon¬ 
strate  that  the  proposed  method  is  much  more  efficient  than 
the  naive  method  with  the  same  accuracy. 

2  Information  Diffusion  Model 

Let  G  =  (V,  E)  be  a  directed  network,  where  V  and  E  (C 
V  xV)  stand  for  the  sets  of  all  the  nodes  and  (directed)  links, 
respectively.  For  any  v  €  V,  let  F(i;:  G )  denote  the  set  of  the 
child  nodes  (directed  neighbors)  of  v,  that  is, 

F(n;G)  =  {w  eV;  ( v,w )  G  E}. 

2.1  SIS  Model 

An  SIS  model  for  the  spread  of  a  disease  is  based  on  the  cycle 
of  disease  in  a  host.  A  person  is  first  susceptible  to  the  dis¬ 
ease,  and  becomes  infected  with  some  probability  when  the 
person  encounters  an  infected  person.  The  infected  person 
becomes  susceptible  to  the  disease  soon  without  moving  to 
the  immune  state.  We  consider  a  discrete-time  SIS  model  for 
information  diffusion  on  a  network.  In  this  context,  infected 
nodes  mean  that  they  have  just  adopted  the  information,  and 
we  call  these  infected  nodes  active  nodes. 

We  define  the  SIS  model  for  information  diffusion  on  G. 
In  the  model,  the  diffusion  process  unfolds  in  discrete  time- 
steps  t  >  0,  and  it  is  assumed  that  the  state  of  a  node  is  either 
active  or  inactive.  For  every  link  (u,  v)  €  E,  we  specify  a 
real  value  pUjV  with  0  <  pUrV  <  1  in  advance.  Here,  pu  v  is 
referred  to  as  the  propagation  probability  through  link  (u,  v). 
Given  an  initial  set  of  active  nodes  X  and  a  time  span  T, 
the  diffusion  process  proceeds  in  the  following  way.  Suppose 
that  node  u  becomes  active  at  time-step  t  (<  T).  Then,  node 
u  attempts  to  activate  every  v  G  F(it;  G),  and  succeeds  with 
probability  puv.  If  node  u  succeeds,  then  node  v  will  become 
active  at  time-step  t  +  1.  If  multiple  active  nodes  attempt  to 
activate  node  v  in  time-step  t,  then  their  activation  attempts 
are  sequenced  in  an  arbitrary  order.  On  the  other  hand,  node 
u  will  become  inactive  at  time-step  t  +  1  unless  it  is  activated 
from  an  active  node  in  time-step  t.  The  process  terminates  if 
the  current  time-step  reaches  the  time  limit  T. 

2.2  Influence  Function 

For  the  SIS  model  on  G,  we  consider  a  diffusion  sample  from 
an  initial  active  node  v  G  V  over  time  span  T.  Let  S(v,t) 
denote  the  set  of  active  nodes  at  time-step  t.  Note  that  S(v,  t) 


is  a  random  subset  of  V  and  S(v,  0)  =  {«}.  Let  a(v,  t)  de¬ 
note  the  expected  number  of  |5(u,t)|,  where  |X|  stands  for 
the  number  of  elements  in  a  set  X.  We  call  cr(v,  t)  the  influ¬ 
ence  of  node  v  at  time-step  t.  Note  that  a  is  a  function  defined 
on  V  x  {0, 1,  •  •  • ,  T}.  We  call  the  function  a  the  influence 
function  for  the  SIS  model  over  time  span  T  on  network  G. 

It  is  important  to  estimate  the  influence  function  a  effi¬ 
ciently.  We  can  simply  estimate  a  by  the  simulations  based 
on  the  SIS  model  in  the  following  way.  First,  a  sufficiently 
large  positive  integer  M  is  specified.  For  each  v  G  V,  the 
diffusion  process  of  the  SIS  model  is  simulated  from  the  ini¬ 
tial  active  node  v,  and  the  number  of  active  nodes  at  time-step 
t,  |5(tt,  t) |,  is  calculated  for  every  t  G  {0, 1,  •  •  • ,  T}.  Then, 
a(v,t)  is  estimated  as  the  empirical  mean  of  |S(u,t)|’s  that 
are  obtained  from  M  such  simulations.  We  refer  to  this  esti¬ 
mation  method  as  the  naive  method.  As  shown  in  the  exper¬ 
iments,  the  naive  method  is  extremely  inefficient,  and  cannot 
be  practical. 

3  Proposed  Method 

We  propose  a  method  for  efficiently  estimating  the  influence 
function  er  over  time  span  T  for  the  SIS  model  on  network  G. 

3.1  Layered  Graph 

We  build  a  layered  graph  GT  =  ( VT,ET )  from  G  in  the 
following  way.  First,  for  each  node  v  €  V  and  each  time-step 
t  G  {0, 1,  •  •  • ,  T},  we  generate  a  copy  vt  of  v  at  time-step  t. 
Let  Vt  denote  the  set  of  copies  of  all  v  G  V  at  time-step  t. 
We  define  VT  by  VT  =  Vo  U  Vi  U  •  •  •  U  Vt-  In  particular, 
we  identify  V  with  Vo-  Next,  for  each  link  (u,v)  G  E,  we 
generate  T  links  (ut~i,Vt),  ( t  G  {1,  •  ■  •  ,T}),  in  the  set  of 
nodes  VT .  We  set  Et  =  i >*);  (it,  v)  G  E },  and  define 

Et  by  ET  =  Ei  U-  •  -UEt.  Moreover,  for  any  link  (itt_ i,vt) 
of  the  layered  graph  GT,  we  define  the  occupation  probability 
Qut-1,vt  by  Qut—i,vt  —  Pu,v 

Then,  we  can  easily  prove  that  the  SIS  model  with  prop¬ 
agation  probabilities  { pe ;  e  G  E}  oil  G  over  time  span  T  is 
equivalent  to  the  bond  percolation  process  (BP)  with  occu¬ 
pation  probabilities  {qe',e  G  ET}  on  GT }  Here,  the  BP 
process  with  occupation  probabilities  {qe\  e  G  ET}  on  GT  is 
the  random  process  in  which  each  link  e  G  ET  is  indepen¬ 
dently  declared  “occupied”  with  probability  qe.  We  perform 
the  BP  process  on  GT ,  and  generate  a  graph  constructed  by 
occupied  links,  GT  =  (VT  ,ET).  Then,  in  terms  of  infor¬ 
mation  diffusion  by  the  SIS  model  on  G,  an  occupied  link 
(ut~i,Vt)  G  Et  represents  a  link  (u,v)  G  E  through  which 
the  information  propagates  at  time-step  t,  and  an  unoccupied 
link  (ut~i,Vt)  G  Et  represents  a  link  (u,v)  G  E  through 
which  the  information  does  not  propagate  at  time-step  t.  For 
any  v  G  V,  let  F(v;  GT)  be  the  set  of  all  nodes  that  can  be 
reached  from  v  (=  uq)  through  a  path  on  the  graph  GT .  When 
we  consider  a  diffusion  sample  from  an  initial  active  node 
v  G  V  for  the  SIS  model  on  G,  F(y\  GT)  H  V)  represents  the 
set  of  active  nodes  at  time-step  t,  S(v,t). 

1  The  SIS  model  over  time  span  T  on  G  can  be  exactly  mapped 
onto  the  IC  model  on  GT  [Kernpe  et  al. ,  2003],  Thus,  the  result  fol¬ 
lows  from  the  equivalence  of  the  BP  process  and  the  IC  model  [New¬ 
man,  2003;  Kernpe  et  al.,  2003;  Kirnura  et  al.,  2007], 


3.2  Bond  Percolation  Method 

Using  the  equivalent  BP  process,  we  present  a  method  for 
efficiently  estimating  influence  function  a.  We  refer  to  this 
method  as  the  BP  method.  Unlike  the  naive  method,  the  BP 
method  simultaneously  estimates  a(v,  t)  for  all  v  £  V.  More¬ 
over,  the  BP  method  does  not  fully  perform  the  BP  process, 
but  performs  it  partially.  Note  first  that  all  the  paths  from  a 
node  v  £  V  on  the  graph  GT  represent  a  diffusion  sample 
from  the  initial  active  node  v  for  the  SIS  model  on  G.  Let  V 
be  the  set  of  the  links  in  G1  that  is  not  in  the  diffusion  sam¬ 
ple.  For  calculating  |S(u,  £)|,  it  is  unnecessary  to  determine 
whether  the  links  in  L'  are  occupied  or  not.  Therefore,  the  BP 
method  performs  the  BP  process  for  only  an  appropriate  set 
of  links  in  GT .  The  BP  method  estimates  o  by  the  following 
algorithm: 

BP  method: 

1.  Set  a{v ,  t)  4—  0  for  each  v  £  V  and  t  £  {1,  •  •  • ,  T}. 

2.  Repeat  the  following  procedure  M  times: 

2-1.  Initialize  S(v,  0)  =  {v}  for  each  v  £  V,  and  set 
A(0)  <-  V,  A(  1)  <-  0,  •  •  •,  A(T)  4-  0. 

2-2.  For  t  =  1  to  T  do  the  following  steps: 

2-2a.  Compute  B(t  -  1)  =  U„eA(t-i)  S{v,t-  1). 

2-2b.  Perform  the  BP  process  for  the  links  from  B(t  —  1)  in 
GT ,  and  generate  the  graph  Gt  constructed  by  the  occu¬ 
pied  links. 

2-2c.  For  each  v  £  A(t  —  1),  compute  S(v,t )  = 

and  set  a(v,t)  v(v,t)  + 
|iS(t),£)|  and  A(t)  <—  A(t)  U  {z;}  if  S(v,t )  ^  0. 

3.  For  each  v  £  V  and  t  £  {1,---,T},  set  a(v,t)  <— 

o(v,  t) /M,  and  output  a(v,  t). 

Note  that  A(t)  finally  becomes  the  set  of  information  source 
nodes  that  have  at  least  an  active  node  at  time-step  t,  that  is, 
A(t)  =  {v  £  V;  S(v,t)  ^  0}.  Note  also  that  B{t  —  1)  is 
the  set  of  nodes  that  are  activated  at  time-step  t  —  1  by  some 
source  nodes,  that  is,  B(t  —  1)  =  U„gy  S{v >  t  ~  !)• 

Now  we  estimate  the  computational  complexity  of  the  BP 
method  in  terms  of  the  number  of  the  nodes,  Ma,  that  are 
identified  in  step  2-2a,  the  number  of  the  coin-flips,  A 4,  for 
the  BP  process  in  step  2-2b,  and  the  number  of  the  links,  Mc, 
that  are  followed  in  step  2-2c.  Let  d(v)  be  the  number  of 
out-links  from  node  v  (i.e.,  out-degree  of  v)  and  d'(y)  the 
average  number  of  occupied  out-links  from  node  v  after  the 
BP  process.  Here  we  can  estimate  d'(y)  by  (v-G)  Pv,w- 

Then,  for  each  time-step  t  £  {1,  •  •  • ,  T},  we  have 

A fa=  y  |S(v,f  -  1)|,  Mb  =  y  d(w),  (!) 

v€A(t— 1)  w£B(t— 1) 

and  _ 

mc  =  y  y  d'(w)  (2) 

v£A(t— 1)  wES(v,t— 1) 

on  average. 

In  order  to  compare  the  computational  complexity  of  the 
BP  method  to  that  of  the  naive  method,  we  consider  mapping 


the  naive  method  onto  the  BP  framework,  that  is,  separating 
the  coin-flip  process  and  the  link-following  process.  We  can 
easily  verify  that  the  following  algorithm  in  the  BP  frame¬ 
work  is  equivalent  to  the  naive  method: 

A  method  that  is  equivalent  to  the  naive  method: 

1.  Set  o(v,  t)  <—  0  for  each  v  £  V  and  t  £  {1,  •  •  • ,  T}. 

2.  Repeat  the  following  procedure  M  times: 

2-1.  Initialize  S(v,  0)  =  {«}  for  each  v  £  V ,  and  set 
A(0)  4-  V,  A(  1)  4-  0,  •  • A(T)  4-  0. 

2-2.  For  t  =  1  to  T  do  the  following  steps: 

2-2b\  For  each  v  £  A(t  —  1),  perform  the  BP  process  for 
the  links  from  S(v,  t  —  1)  in  GT,  and  generate  the  graph 
Gt.(v)  constructed  by  the  occupied  links. 

2-2c\  For  each  v  £  A[t  —  1),  compute  S(v;t)  = 

Utu6s(yt-i)rKG'i<>))’  and  set  ct(M)  + 

\S(v,  f)|  and  A(t)  <—  A(t)  U  {v}  if  S(v,  t)  ^  0. 

3.  For  each  v  £  V  and  t  £  set  a(v,t)  <— 

cr(v,  t)/M ,  and  output  a(v ,  t). 

Then,  for  each  t  £  {1,  •  •  • ,  T},  the  number  of  coin-flips,  Mb', 
in  step  2-2b’  is 

a rv  =  y  y  d(w ),  (3) 

v£A{t— 1)  w€iS(v,t—l) 

and  the  number  of  the  links,  Mc>,  followed  in  step  2-2c’  is 
equal  to  Mc  in  the  BP  method  on  average.  From  equations  (2) 
and  (3),  we  can  see  that  Mb'  is  much  larger  than  Met  =  Mc, 
especially  for  the  case  where  the  diffusion  probabilities  are 
small.  By  equations  (1)  and  (3),  we  can  also  see  that  Mb' 
is  generally  much  larger  than  each  of  Ma  and  Mi,  in  the  BP 
method  for  a  real  social  network.  In  fact,  since  such  a  net¬ 
work  generally  includes  large  clique-like  subgraphs,  there  are 
many  nodes  w  £  V  such  that  d(w)  1,  and  we  can  expect 
that  T,veA(t-i)  \S(v,t~l)\  >  IUueA(t-i)'S'(M-1)|  (= 
| B(t  —  1)|).  Therefore,  the  BP  method  is  expected  to  achieve 
a  large  reduction  in  computational  cost. 

3.3  Pruning  Method 

In  order  to  further  improve  the  computational  efficiency  of  the 
BP  method,  we  introduce  a  pruning  technique  and  propose  a 
method  referred  to  as  the  BP  with  priming  method.  The  key 
idea  of  the  pruning  technique  is  to  utilize  the  following  prop¬ 
erty:  Once  we  have  S(u,to)  =  S(v,to)  at  some  time-step 
to  on  the  course  of  the  BP  process  for  a  pair  of  information 
source  nodes,  u  and  v,  then  we  have  S(u,t)  =  S(v,t)  for 
all  t  >  to.  The  BP  with  pruning  method  estimates  a  by  the 
following  algorithm: 

BP  with  pruning  method: 

1.  Set  a(v,  t)  4—0  for  each  v  £  V  and  t  £  {1,  •  •  • ,  T}. 

2.  Repeat  the  following  procedure  M  times. 

2-1”.  Initialize  S(v;  0)  =  {v}  for  each  v  £  V.  and  set 
A(0)  4-  V,  A{  1)  4  0,  •  •  *,  A(T)  4-  0,  and  C(v)  4- 

{v}  for  each  v  £  V. 

2-2.  For  t  =  1  to  T  do  the  following  steps: 


2-2a.  Compute  B(t  -  1)  =  (J„eA(t-i)  S(v,t-  1). 

2-2b.  Perform  the  BP  process  for  the  links  from  B(t  —  1)  in 
GT ,  and  generate  the  graph  Gt  constructed  by  the  occu¬ 
pied  links. 

2-2c”.  For  each  v  G  A(t  —  1),  compute  S(v,t)  = 

set  A(t)  <-  A{t)  U  {w}  if 
S(v,t)  ^  0,  and  set  a(u,t)  <—  a(u,t)  +  |S(v,f)|  for 
each  u  G  C(v). 

2-2d.  Check  whether  S(u,  t )  =  S(v,  t)  for  u,  v  G  A(t),  and 
set  C(v)  <—  C(v)  U  C(u)  and  A(t)  <—  A(t)  \  {«}  if 
S(u,  t)  =  S(v,  t). 

3.  For  each  v  G  V  and  t  G  {1,  set  a(v,t)  <— 

a(v,  t) /M,  and  output  a(v,  t). 

Basically,  by  introducing  step  2-2d  and  reducing  the  size  of 
A(t),  the  proposed  method  attempts  to  improve  the  computa¬ 
tional  efficiency  in  comparison  to  the  original  BP  method. 

For  the  proposed  method,  it  is  important  to  implement  ef¬ 
ficiently  the  equivalence  check  process  in  step  2-2d.  In  our 
implementation,  we  first  classify  each  v  G  A(t)  according  to 
the  value  of  k  =  f)|,  and  then  perform  the  equivalence 

check  process  only  for  those  nodes  with  the  same  k  value. 
How  effectively  the  proposed  method  works  will  depend  on 
several  conditions  such  as  network  structure,  time  span,  val¬ 
ues  of  diffusion  probabilities,  and  so  on.  We  will  do  a  simple 
analysis  later  and  experimentally  show  that  it  is  indeed  effec¬ 
tive. 

4  Experimental  Evaluation 

4.1  Network  Data  and  Settings 

In  our  experiments,  we  employed  two  datasets  of  large  real 
networks  used  in  [Kimura  et  al. ,  2009] ,  which  exhibit  many 
of  the  key  features  of  social  networks. 

The  first  one  is  a  trackback  network  of  Japanese  blogs.  The 
network  data  was  collected  by  tracing  the  trackbacks  from 
one  blog  in  the  site  “goo  (http://blog.goo.ne.jp/)”  in  May, 
2005.  We  refer  to  the  network  data  as  the  blog  network. 
The  blog  network  was  a  strongly-connected  bidirectional  net¬ 
work,  where  a  link  created  by  a  trackback  was  regarded  as  a 
bidirectional  link  since  blog  authors  establish  mutual  com¬ 
munications  by  putting  trackbacks  on  each  other’s  blogs.  The 
blog  network  had  12, 047  nodes  and  79,  920  directed  links. 

The  second  one  is  a  network  of  people  that  was  derived 
from  the  “list  of  people”  within  Japanese  Wikipedia.  We  refer 
to  the  network  data  as  the  Wikipedia  network.  The  Wikipedia 
network  was  also  a  strongly-connected  bidirectional  network, 
and  had  9, 481  nodes  and  245, 044  directed  links. 

For  the  SIS  model,  we  assigned  a  uniform  probability  p 
to  the  propagation  probability  pUtV  for  any  link  (u,  v)  G  E, 
that  is,  pu>v  =  p.  According  to  [Kempe  et  al.,  2003; 
Leskovec  et  al.,  2007b],  we  set  the  value  of  p  relatively  small. 
In  particular,  we  set  the  value  of  p  to  a  value  smaller  than  1  /d, 
where  d  is  the  mean  out-degree  of  a  network.  Since  the  values 
of  d  were  about  6.63  and  25.85  for  the  blog  and  the  Wikipedia 
networks,  respectively,  the  corresponding  values  of  1/d  were 
about  0.15  and  0.03.  We  decided  to  set  p  =  0.1  for  the  blog 
network  and  p  =  0.01  for  the  Wikipedia  network. 


All  our  experimentation  was  undertaken  on  a  single  PC 
with  an  Intel  Core  2  Duo  E6850  3GHz  processor,  with  3GB 
of  memory,  running  under  Linux. 

4.2  Estimation  Accuracy  Comparison 

We  first  compared  the  accuracy  of  the  estimated  influence 
function  a  of  the  proposed  method  (BP  with  pruning)  with 
that  of  the  naive  method.  Both  methods  require  M  to  be  spec¬ 
ified  in  advance  as  a  parameter.  As  shown  in  section  3.2,  the 
number  of  coin  flips  is  different  in  these  two  methods  and 
it  is  much  larger  in  the  naive  method.  However,  this  does 
not  mean  that  there  is  more  randomness  introduced  in  the 
naive  method  and  thus  the  convergence  of  the  naive  method 
is  faster.  In  fact  for  each  single  initially  activated  node  v  from 
which  to  propagate  the  information,  the  number  of  indepen¬ 
dent  coin-flips  is  effectively  the  same  for  the  both  methods. 
Thus  by  using  the  same  value  of  M,  both  would  estimate 
a(y,t)  with  the  same  accuracy  in  principle. 

Table  1 :  Results  for  the  naive  method  on  the  blog  network. 


Rank 

Node  ID 

Influence 

Node  ID 

Influence 

1 

2210 

984.38 

2210 

985.74 

2 

2248 

979.59 

2248 

980.72 

3 

3906 

956.82 

3906 

956.57 

4 

3907 

953.14 

3907 

953.89 

5 

146 

931.03 

146 

931.62 

6 

155 

929.68 

155 

930.21 

7 

3233 

913.50 

3233 

911.89 

8 

3228 

912.27 

3228 

910.52 

9 

140 

910.04 

140 

910.37 

10 

2247 

909.59 

2247 

910.00 

Table  2:  Results  for  the  proposed  method  on  the  blog  net¬ 
work. 


Rank 

Node  ID 

Influence 

Node  ID 

Influence 

1 

2210 

984.74 

2210 

984.87 

2 

2248 

980.41 

2248 

979.46 

3 

3906 

956.97 

3906 

955.84 

4 

3907 

953.04 

3907 

952.71 

5 

146 

929.96 

146 

929.30 

6 

155 

928.77 

155 

928.49 

7 

3233 

912.61 

3233 

911.01 

8 

3228 

912.18 

3228 

910.49 

9 

140 

909.22 

140 

910.31 

10 

2247 

909.12 

2247 

909.59 

We  have  experimentally  confirmed  that  use  of  M  = 
100,000  gives  in  effect  the  same  value  of  a(v,t),  for  t  = 
1,  •  •  • ,  20.  The  following  accuracy  comparison  is  based  on 
M  =  100, 000.  Tables  1  and  2  show  the  ranking  of  the 
influential  initially  activated  nodes  v  evaluated  at  time-step 
T  =  20  for  the  blog  network.  The  value  of  influence  func¬ 
tion  a(v ,  20)  is  sorted  in  the  decreasing  order  and  the  top  10 
nodes  are  listed.  We  repeated  the  experiment  several  times 
and  listed  two  of  them.  Note  that  the  naive  method  takes  an 
order  of  week  to  return  the  result  and  we  could  not  set  T  a 


Table  3:  Results  for  the  naive  method  on  the  Wikipedia  net¬ 
work. 


Rank 

Node  ID 

Influence 

Node  ID 

Influence 

1 

4019 

134.73 

4019 

133.83 

2 

3729 

133.24 

3729 

132.42 

3 

7919 

132.66 

7919 

131.98 

4 

4380 

132.23 

1720 

131.68 

5 

1720 

132.20 

4380 

131.34 

6 

4465 

132.10 

4465 

131.07 

7 

1712 

131.65 

1712 

130.69 

8 

3670 

130.32 

1073 

129.48 

9 

1073 

129.66 

3670 

129.46 

10 

1191 

128.61 

1191 

128.38 

Table  4:  Results  for  the  proposed  method  on  the  Wikipedia 
network. 


Rank 

Node  ID 

Influence 

Node  ID 

Influence 

1 

4019 

134.25 

4019 

133.67 

2 

3729 

132.91 

7919 

132.17 

3 

7919 

132.50 

3729 

132.02 

4 

4380 

132.03 

4380 

131.84 

5 

4465 

131.95 

1720 

131.63 

6 

1720 

131.59 

4465 

131.12 

7 

1712 

131.33 

1712 

130.90 

8 

3670 

130.27 

3670 

129.78 

9 

1073 

129.22 

1073 

129.12 

10 

1191 

128.71 

1191 

128.40 

larger  value.  We  note  that  the  ranking  is  exactly  the  same 
for  the  both  methods.  Tables  3  and  4  are  the  result  for  the 
Wikipedia  network.  The  nodes  in  the  4th  and  the  5th  ranks 
for  the  naive  method,  and  the  5th  and  the  6th  ranks  for  the 
proposed  method  are  interchanged  respectively,  but  the  rests 
are  the  same.  From  these  results  we  confirm  that  the  proposed 
method  gives  the  same  results  as  the  naive  method  with  the 
same  value  of  M  when  M  is  large  enough. 

4.3  Processing  Time  Comparison 

Next,  we  compared  the  processing  time  of  the  proposed 
method  (BP  with  pruning)  with  the  BP  method  without  prun¬ 
ing  and  the  naive  method.  Here,  we  used  M  =  1,  000  in 
order  to  keep  the  computational  time  for  the  naive  method 
at  a  reasonable  level  so  that  it  runs  for  a  larger  T.  Fig¬ 
ures  1  and  2  show  the  total  processing  time  to  estimate 
{a(v,t);v  £  V,t  =  0, 1,  •  •  •  ,T}  as  a  function  of  time  span 
T  for  the  blog  and  the  Wikipedia  networks,  respectively.  In 
these  figures,  the  circles,  squares  and  triangles  indicate  the 
results  for  the  proposed  method  (BP  with  pruning),  the  BP 
method  without  pruning,  and  the  naive  method,  respectively. 
Note  that  in  case  of  the  blog  network,  the  processing  time 
for  time  span  T  =  100  is  about  7  minutues,  2  hours  and 
37  hours  for  the  proposed  method,  the  BP  method  without 
pruning  and  the  naive  method,  respectively.  Namely,  the  pro¬ 
posed  method  is  about  20  and  310  times  faster  than  the  BP 
method  without  pruning  and  the  naive  method,  respectively, 
for  T  =  100  in  case  of  the  blog  network.  Note  also  that  in 
case  of  the  Wikipedia  network,  the  processing  time  for  time 


Figure  1 :  Results  for  the  blog  network. 


time  span  T 


Figure  2:  Results  for  the  Wikipedia  network. 

span  T  =  100  is  about  3  minutes,  6  minutes  and  8  hours 
for  the  proposed  method,  the  BP  method  without  pruning  and 
the  naive  method,  respectively.  Namely,  the  proposed  method 
is  about  2  and  150  times  faster  than  the  BP  method  without 
pruning  and  the  naive  method,  respectively,  for  T  =  100  in 
case  of  the  Wikipedia  network. 

In  general,  the  proposed  method  performs  the  best  and  the 
BP  method  without  pruning  follows  with  an  exception  that 
the  proposed  method  can  become  slightly  slower  than  the 
BP  method  without  pruning  in  cases  where  T  is  small  be¬ 
cause  of  the  overhead  introduced  in  pruning.  The  two  BP 
methods  (with  and  without  pruning)  are  much  faster  than  the 
naive  method.  The  performance  difference  between  the  pro¬ 
posed  method  and  each  of  the  other  two  methods  increases 
as  time- step  (or  time  span)  increases.  Moreover,  the  same 
performance  difference  becomes  larger  for  the  blog  network 


than  the  Wikipedia  network.  The  following  simple  analysis 
explains  this.  Consider  the  extreme  case  where  S(u,t)  = 
S(v,t)  for  \/u,v  £  A(t)  and  d(w )  =  d  for  Mw  £  S(v,t) 
(v  €  A(t))  at  some  time-step  t.  We  denote  |A(t)|  =  a  and 
|5(w,t)|  =  s.  Then,  we  have  A fa  =  as,  Afb  =  sd,  Mb'  = 
asd  and  Mc  =  asd'  on  average  for  time-step  t  +  1  (see  equa¬ 
tions  (1),  (2)  and  (3)).  Recall  that  d'  is  the  expected  number 
of  the  occupied  links,  which  is  calculated  as  pd,  where  p  is 
the  common  diffusion  probability  for  all  links.  Further  as¬ 
sume  that  the  pruning  was  ideal  such  that  Ma  =  s  and  Mc 
=  sd1,  which  respectively  denote  the  number  of  nodes  iden¬ 
tified  in  step  2-2a  and  the  average  number  of  links  followed 
in  step  2-2c”  for  the  BP  with  pruning  method.  Then,  if  ad' 
>  d,  i.e.,  ad' /d  =  ap  >  1  holds,  the  improvement  ratios  of 
the  BP  with  pruning  method  over  the  naive  method  and  the 
original  BP  method  are  respectively  asd/ sd  =  a  and  asd' / sd 
=  ap.  From  our  experimental  results,  we  can  estimate  a  to 
be  310  for  the  blog  network  and  150  for  the  Wikipedia  net¬ 
work.  Then  we  obtain  ap  to  be  31  and  1.5  respectively,  which 
approximates  the  actual  ratio  each,  20  and  2. 

5  Discussion 

Here,  we  compare  the  method  proposed  in  [Kimura  et  al., 
2007]  that  efficiently  estimates  the  influence  function  also  in 
the  framework  of  bond  percolation  for  the  IC  and  the  LT  mod¬ 
els.  The  same  method  is  not  applicable  to  the  SIS  model. 
The  key  idea  there  is  to  decompose  the  graph  that  is  gener¬ 
ated  by  the  bond  percolation  into  a  set  of  strongly  connected 
components  (SCC)  and  efficiently  calculate  the  node  reach¬ 
ability.  However,  the  layered  graph  in  the  proposed  method 
is  a  directed  acyclic  tree  and  the  SCC  decomposition  would 
not  work  effectively.  The  pruning  technique  in  the  proposed 
method  is  a  new  technique  to  improve  the  computational  effi¬ 
ciency  for  the  SIS  model,  just  like  the  SCC  decomposition  is 
for  the  IC  and  the  LT  models. 

In  this  paper  we  did  not  directly  address  the  influential 
maximization  problem,  but  only  proposed  a  new  method  to 
efficiently  estimate  the  influence  function.  We  can  think  of 
two  maximization  problems,  that  is  to  find  the  initial  active 
nodes  with  a  specified  number  that  maximize  1)  the  expected 
number  of  nodes  that  have  been  activated  till  the  end  of  time- 
step  T  and  2)  the  expected  number  of  active  nodes  at  the  end 
of  time-step  T.  The  proposed  method  can  easily  be  extended 
to  efficiently  estimate  the  marginal  gain  of  the  objective  func¬ 
tion  of  each  of  the  optimization  problems  when  the  problems 
are  to  be  solved  by  greedy  algorithms. 

6  Conclusion 

Finding  influential  nodes  is  one  of  the  most  central  problems 
in  the  field  of  social  network  analysis.  There  are  several  mod¬ 
els  that  simulate  how  various  things,  e.g.,  news,  rumors,  dis¬ 
eases,  innovation,  ideas,  etc.  diffuse  across  the  network.  One 
such  realistic  model  is  the  susceptible/infected/susceptible 
(SIS)  model,  an  information  diffusion  model  where  nodes 
are  allowed  to  be  activated  multiple  times.  The  computa¬ 
tional  complexity  drastically  increases  because  of  this  mul¬ 
tiple  activation  property,  e.g.,  compared  with  the  suscep¬ 
tible/infected/recovered  (SIR)  model  where  once  activated 


nodes  can  never  be  deactivated/reactivated.  We  addressed  the 
problem  of  efficiently  estimating  the  influence  function  under 
the  SIS  model,  i.e.,  estimating  the  expected  number  of  acti¬ 
vated  nodes  at  time-step  t  for  t  =  1,  •  •  • ,  T  starting  from  an 
initially  activated  node  v  (for  all  v  £  V)  at  time-step  t  =  0. 
We  solved  this  problem  by  constructing  a  layered  graph  from 
the  original  social  network  by  adding  each  layer  on  top  of 
the  existing  layers  as  the  time  proceeds,  and  applying  the 
bond  percolation  with  a  pruning  strategy.  We  showed  that  the 
computational  complexity  of  the  proposed  method  is  much 
smaller  than  the  conventional  naive  probabilistic  simulation 
method  by  a  theoretical  analysis.  We  further  confirmed  this 
by  applying  the  proposed  method  to  two  real  world  networks 
taken  from  blog  and  Wikipedia  data.  Considerable  reduction 
of  computation  time  was  achieved  without  degrading  the  ac¬ 
curacy. 

References 

[Adar  and  Adamic,  2005]  E.  Adar  and  L.  A.  Adamic.  Track¬ 
ing  information  epidemics  in  blogspace.  In  WI’05,  pages 
207-214,2005. 

[Agarwal  and  Liu,  2008]  N.  Agarwal  and  H.  Liu.  Blogo- 
sphere:  Research  issues,  tools,  and  applications.  SIGKDD 
Explorations,  10(1):  18-31,  2008. 

[Domingos  and  Richardson,  2001]  R  Domingos  and 
M.  Richardson.  Mining  the  network  value  of  customers. 
In  KDD’01,  pages  57-66,  2001. 

[Gruhl  et  al. ,  2004]  D.  Gruhl,  R.  Guha,  D.  Liben-Nowell, 
and  A.  Tomkins.  Information  diffusion  through  blogspace. 
In  WWW’04,  pages  107-1 17,  2004. 

[Kempe  et  al.,  2003]  D.  Kempe,  J.  Kleinberg,  and  E.  Tardos. 
Maximizing  the  spread  of  influence  through  a  social  net¬ 
work.  lnKDD’03,  pages  137-146,  2003. 

[Kimura  et  al.,  2007]  M.  Kimura,  K.  Saito,  and  R.  Nakano. 
Extracting  influential  nodes  for  information  diffusion  on  a 
social  network.  In  AAAI’07,  pages  1371-1376,  2007. 

[Kimura  et  al.,  2009]  M.  Kimura,  K.  Saito,  and  H.  Motoda. 
Blocking  links  to  minimize  contamination  spread  in  a  so¬ 
cial  network.  ACM  Transactions  on  Knowledge  Discovery 
from  Data,  3(2):9: 1—9:23,  2009. 

[Leskovec  et  al.,  2007a]  J.  Leskovec,  A.  Krause, 
C.  Guestrin,  C.  Faloutsos,  J.  VanBriesen,  and  N.  Glance. 
Cost-effective  outbreak  detection  in  networks.  In  KDD  ’07, 
pages  420-429,  2007. 

[Leskovec  et  al.,  2007b]  J.  Leskovec,  M.  McGlohon, 
C.  Faloutsos,  ,  N.  Glance,  and  M.  Hurst.  Patterns  of 
cascading  behavior  in  large  blog  graphs.  In  SDM’07, 
pages  551-556,  2007. 

[Newman,  2003]  M.  E.  J.  Newman.  The  structure  and  func¬ 
tion  of  complex  networks.  SIAM  Review,  45(2):  167-256, 
2003. 

[Richardson  and  Domingos,  2002]  M.  Richardson  and 
P.  Domingos.  Mining  knowledge-sharing  sites  for  viral 
marketing.  In  KDD’02,  pages  61-70,  2002. 


Discovering  Influential  Nodes  for  SIS  models  in  Social 

Networks 


Kazumi  Saito1,  Masahiro  Kimura2,  and  Hiroshi  Motoda3 

1  School  of  Administration  and  Informatics,  University  of  Shizuoka 

52-1  Yada,  Suruga-ku,  Shizuoka  422-8526,  Japan 
k-saito@u-shizuoka-ken. ac . jp 

2  Department  of  Electronics  and  Informatics,  Ryukoku  University 

Otsu,  Shiga  520-2194,  Japan 
kimur a@r ins . ryukoku .ac.jp 

3  Institute  of  Scientific  and  Industrial  Research,  Osaka  University 

8-1  Mihogaoka.  Ibaraki,  Osaka  567-0047.  Japan 
motoda@ar . sanken . osaka-u .ac.jp 


Abstract.  We  address  the  problem  of  efficiently  discovering  the  influential  nodes 
in  a  social  network  under  the  susceptible/infected/susceptible  (SIS)  model,  a  diffu¬ 
sion  model  where  nodes  are  allowed  to  be  activated  multiple  times.  The  compu¬ 
tational  complexity  drastically  increases  because  of  this  multiple  activation  prop¬ 
erty.  We  solve  this  problem  by  constructing  a  layered  graph  from  the  original 
social  network  with  each  layer  added  on  top  as  the  time  proceeds,  and  apply¬ 
ing  the  bond  percolation  with  pruning  and  burnout  strategies.  We  experimentally 
demonstrate  that  the  proposed  method  gives  much  better  solutions  than  the  con¬ 
ventional  methods  that  are  solely  based  on  the  notion  of  centrality  for  social  net¬ 
work  analysis  using  two  large-scale  real-world  networks  (a  blog  network  and  a 
wikipedia  network).  We  further  show  that  the  computational  complexity  of  the 
proposed  method  is  much  smaller  than  the  conventional  naive  probabilistic  sim¬ 
ulation  method  by  a  theoretical  analysis  and  confirm  this  by  experimentation. 
The  properties  of  the  influential  nodes  discovered  are  substantially  different  from 
those  identified  by  the  centrality-based  heuristic  methods. 


1  Introduction 

Social  networks  mediate  the  spread  of  various  information  including  topics,  ideas  and 
even  (computer)  viruses.  The  proliferation  of  emails,  blogs  and  social  networking  ser¬ 
vices  (SNS)  in  the  World  Wide  Web  accelerates  the  creation  of  large  social  networks. 
Therefore,  substantial  attention  has  recently  been  directed  to  investigating  information 
diffusion  phenomena  in  social  networks  [1-3]. 

Overall,  finding  influential  nodes  is  one  of  the  most  central  problems  in  social  net¬ 
work  analysis.  Thus,  developing  methods  to  do  this  on  the  basis  of  information  diffusion 
is  an  important  research  issue.  Widely-used  fundamental  probabilistic  models  of  infor¬ 
mation  diffusion  are  the  independent  cascade  (IC)  model  and  the  linear  threshold  (LT) 
model  [4, 5],  Researchers  investigated  the  problem  of  finding  a  limited  number  of  influ¬ 
ential  nodes  that  are  effective  for  the  spread  of  information  under  the  above  models  [4, 
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6].  This  combinatorial  optimization  problem  is  called  the  influence  maximization  prob¬ 
lem.  Kempe  et  al.  [4]  experimentally  showed  on  large  collaboration  networks  that  the 
greedy  algorithm  can  give  a  good  approximate  solution  to  this  problem,  and  mathe¬ 
matically  proved  a  performance  guarantee  of  the  greedy  solution  (i.e.,  the  solution  ob¬ 
tained  by  the  greedy  algorithm).  Recently,  methods  based  on  bond  percolation  [6]  and 
submodularity  [7]  were  proposed  for  efficiently  estimating  the  greedy  solution.  The  in¬ 
fluence  maximization  problem  has  applications  in  sociology  and  “viral  marketing”  [3], 
and  was  also  investigated  in  a  different  setting  (a  descriptive  probabilistic  model  of  in¬ 
teraction)  [8, 9].  The  problem  has  recently  been  extended  to  influence  control  problems 
such  as  a  contamination  minimization  problem  [10]. 

The  IC  model  can  be  identified  with  the  so-called  susceptible/infected/recovered 
(SIR)  model  for  the  spread  of  a  disease  [1 1, 5].  In  the  SIR  model,  only  infected  individ¬ 
uals  can  infect  susceptible  individuals,  while  recovered  individuals  can  neither  infect 
nor  be  infected.  This  implies  that  an  individual  is  never  infected  with  the  disease  mul¬ 
tiple  times.  This  property  holds  true  for  the  LT  model  as  well.  However,  there  exist 
phenomena  for  which  the  property  does  not  hold.  For  example,  consider  the  follow¬ 
ing  propagation  phenomenon  of  a  topic  in  the  blogosphere:  A  blogger  who  has  not  yet 
posted  a  message  about  the  topic  is  interested  in  the  topic  by  reading  the  blog  of  a  friend, 
and  posts  a  message  about  it  (i.e.,  becoming  infected).  Next,  the  same  blogger  reads  a 
new  message  about  the  topic  posted  by  some  other  friend,  and  may  post  a  message 
(i.e.,  becoming  infected)  again.  Most  simply,  this  phenomenon  can  be  modeled  by  an 
susceptible/infected/susceptible  (SIS)  model  from  the  epidemiology.  Like  this  example, 
there  are  many  examples  of  information  diffusion  phenomena  for  which  the  SIS  model 
is  more  appropriate,  including  the  growth  of  hyper-link  posts  among  bloggers  [2],  the 
spread  of  computer  viruses  without  permanent  virus-checking  programs,  and  epidemic 
disease  such  as  tuberculosis  and  gonorrhea  [11]. 

We  focus  on  an  information  diffusion  process  in  a  social  network  G  -  (V,  E)  over 
a  given  time  span  T  on  the  basis  of  an  SIS  model.  Here,  the  SIS  model  is  a  stochastic 
process  model,  and  the  influence  of  a  set  of  nodes  H  at  time-step  t,  cr(H,  t),  is  defined  as 
the  expected  number  of  infected  nodes  at  time-step  t  when  all  the  nodes  in  H  are  initially 
infected  at  time-step  t  -  0.  We  refer  to  cr  as  the  influence  function  for  the  SIS  model. 
Developing  an  effective  method  for  estimating  <x({y},  t).  (v  e  V,  t  =  1 .....  7  )  is  vital  for 
various  applications.  Clearly,  in  order  to  extract  influential  nodes,  we  must  estimate  the 
value  of  cr({v],  t)  for  every  node  v  and  time-step  t.  Thus,  we  proposed  a  novel  method 
based  on  the  bond  percolation  with  an  effective  pmning  strategy  to  efficiently  estimate 
{ cr( { v } ,  t)\  v  e  V,  t  -  l, . . . ,  T]  for  the  SIS  model  in  our  previous  work  [12], 

In  this  paper,  we  consider  solving  the  influence  maximization  problems  on  a  net¬ 
work  G  —  (V,  E)  under  the  SIS  model.  Here,  unlike  the  cases  of  the  IC  and  the  LT 
models,  we  define  two  influence  maximization  problems,  the  final-time  maximization 
problem  and  the  accumulated-time  maximization  problem,  for  the  SIS  model.  We  intro¬ 
duce  the  greedy  algorithm  for  solving  the  problems  according  to  the  work  of  Kempe  et 
al.  [4]  for  the  IC  and  the  LT  models.  Now,  let  us  consider  the  problem  of  influence  max¬ 
imization  at  the  final  time  step  T  (i.e.,  final-time  maximization  problem)  as  an  example. 
We  then  note  that  for  solving  this  problem  by  the  greedy  algorithm,  we  need  a  method 
for  not  only  evaluating  {<x({v},  T);  v  e  V],  but  also  evaluating  the  marginal  influence 
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gains  {cr(H  U  {v},  T)  -  cr(H,  T)\  v  e  V  \  H)  for  any  non-empty  subset  H  of  V.  Needless 
to  say,  we  can  naively  estimate  the  marginal  influence  gains  for  any  non-empty  subset 
H  of  V  by  simulating  the  SIS  model2.  However,  this  naive  simulation  method  is  overly 
inefficient  and  not  practical  at  all.  In  this  paper,  by  incorporating  the  new  techniques 
(the  pruning  and  the  burnout  methods)  into  the  bond  percolation  method,  we  propose 
a  method  to  efficiently  estimate  the  marginal  influence  gains  for  any  non-empty  subset 
H  of  V,  and  apply  it  to  approximately  solve  the  two  influence  maximization  problems 
for  the  SIS  model  by  the  greedy  alogrithm.  We  show  that  the  proposed  method  is  ex¬ 
pected  to  achieve  a  large  reduction  in  computational  cost  by  theoretically  comparing 
computational  complexity  with  other  more  naive  methods.  Further,  using  two  large  real 
networks,  we  experimentally  demonstrate  that  the  proposed  method  is  much  more  ef¬ 
ficient  than  the  naive  greedy  method  based  on  the  bond  percolation  method.  We  also 
show  that  the  discovered  nodes  by  the  proposed  method  are  substantially  different  from 
and  can  result  in  considerable  increase  in  the  influence  over  the  conventional  methods 
that  are  based  on  the  notion  of  various  centrality  measures. 

2  Information  Diffusion  Model 

Let  G  =  (V,  E)  be  a  directed  network,  where  V  and  E  ( c  V  x  V)  stand  for  the  sets  of  all 
the  nodes  and  (directed)  links,  respectively.  For  any  v  e  V,  let  E(v,  G )  denote  the  set  of 
the  child  nodes  (directed  neighbors)  of  v,  that  is, 

r(v;  G)  —  {w  e  V;  (v,  w)  e  E}. 


2.1  SIS  Model 

An  SIS  model  for  the  spread  of  a  disease  is  based  on  the  cycle  of  disease  in  a  host.  A  per¬ 
son  is  first  susceptible  to  the  disease,  and  becomes  infected  with  some  probability  when 
the  person  encounters  an  infected  person.  The  infected  person  becomes  susceptible  to 
the  disease  soon  without  moving  to  the  immune  state.  We  consider  a  discrete-time  SIS 
model  for  information  diffusion  on  a  network.  In  this  context,  infected  nodes  mean  that 
they  have  just  adopted  the  information,  and  we  call  these  infected  nodes  active  nodes. 

We  define  the  SIS  model  for  information  diffusion  on  G.  In  the  model,  the  diffusion 
process  unfolds  in  discrete  time-steps  t  >  0,  and  it  is  assumed  that  the  state  of  a  node 
is  either  active  or  inactive.  For  every  link  (m,  v)  e  E,  we  specify  a  real  value  p,l  v  with 
0  <  Pn,v  <  1  in  advance.  Here,  pu  v  is  referred  to  as  the  propagation  probability  through 
link  («,  v).  Given  an  initial  set  of  active  nodes  X  and  a  time  span  7’,  the  diffusion  process 
proceeds  in  the  following  way.  Suppose  that  node  u  becomes  active  at  time-step  t  (<  T). 
Then,  node  u  attempts  to  activate  every  v  e  /  («;  G),  and  succeeds  with  probability 
pUs-  If  node  u  succeeds,  then  node  v  will  become  active  at  time-step  t  +  1.  If  multiple 
active  nodes  attempt  to  activate  node  v  in  time-step  f,  then  their  activation  attempts 
are  sequenced  in  an  arbitrary  order.  On  the  other  hand,  node  it  will  become  or  remain 
inactive  at  time-step  t  +  1  unless  it  is  activated  from  an  active  node  in  time-step  t.  The 
process  terminates  if  the  current  time-step  reaches  the  time  limit  T. 

2  Note  that  the  method  we  proposed  in  [12]  does  not  perform  simulation. 
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2.2  Influence  Function 

For  the  SIS  model  on  G.  we  consider  a  diffusion  sample  from  an  initially  activated  node 
set  H  c  V  over  time  span  T.  Let  S(H,t )  denote  the  set  of  active  nodes  at  time-step 
t.  Note  that  S  (H.  t)  is  a  random  subset  of  V  and  S  (H,  0)  =  H.  Let  cr(H,  t)  denote  the 
expected  number  of  | S  (H,  f)|,  where  \X\  stands  for  the  number  of  elements  in  a  set  X.  We 
call  cr(H ,  t)  the  influence  of  node  set  H  at  time-step  t.  Note  that  cr  is  a  function  defined 
on  2M  x  {0, 1,  •  •  •  //’)•  We  call  the  function  cr  the  influence  function  for  the  SIS  model 
over  time  span  T  on  network  G. 

It  is  important  to  estimate  the  influence  function  cr  efficiently.  In  theory  we  can 
simply  estimate  cr  by  the  simulations  based  on  the  SIS  model  in  the  following  way. 
First,  a  sufficiently  large  positive  integer  M  is  specified.  For  each  //  c  V,  the  diffusion 
process  of  the  SIS  model  is  simulated  from  the  initially  activated  node  set  //,  and  the 
number  of  active  nodes  at  time-step  t,  |.S'  (H,  f)|,  is  calculated  for  every  t  e  {0, 1,  •  •  •  ,7’). 
Then,  cr(H ,  t)  is  estimated  as  the  empirical  mean  of  |.S'  ( H ,  f)|’s  that  are  obtained  from  M 
such  simulations.  However,  this  is  extremely  inefficient,  and  cannot  be  practical. 

3  Influence  Maximization  Problem 

We  mathematically  define  the  influence  maximization  problems  on  a  network  G  =  (V,  E) 
under  the  SIS  model.  Let  K  be  a  positive  integer  with  K  <  |V|.  First,  we  define  the  final¬ 
time  maximization  problem'.  Find  a  set  ll  'K  of  K  nodes  to  target  for  initial  activation  such 
that  cr(H*K'.  T )  >  cr(77;  T)  for  any  set  H  of  k  nodes,  that  is,  find 

H*k  —  arg  max  cr(H\  T).  (1) 

K  B  \HCV-\H\=K) 

Second,  we  define  the  accumulated-time  maximization  problem'.  Find  a  set  H*K  of  K 
nodes  to  target  for  initial  activation  such  that  cr(H*K ;  1)  +  •  •  ■  +  ad  If.  T)  >  cr(H',  1)  + 
■  •  •  +  cr(77;  T)  for  any  set  H  of  k  nodes,  that  is,  find 

T 

H*k  =  arg  max  ladl'd).  (2) 

{H<zV,\H\=K)  j-f 

The  first  problem  cares  only  how  many  nodes  are  influenced  at  the  time  of  interest. 
For  example,  in  an  election  campaign  it  is  only  those  people  who  are  convinced  to  vote 
the  candidate  at  the  time  of  voting  that  really  matter  and  not  those  who  were  convinced 
during  the  campaign  but  changed  their  mind  at  the  very  end.  Maximizing  the  number 
of  people  who  actually  vote  falls  in  this  category.  The  second  problem  cares  how  many 
nodes  have  been  influenced  throughout  the  period  of  interest.  For  example,  maximizing 
the  amount  of  product  purchase  during  a  sales  campaign  falls  in  this  category. 

4  Proposed  Method 

Kempe  et  al.  [4]  showed  the  effectiveness  of  the  greedy  algorithm  for  the  influence 
maximization  problem  under  the  IC  and  LT  models.  In  this  section,  we  introduce  the 
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greedy  algorithm  for  the  SIS  model,  and  describe  some  techniques  (the  bond  percola¬ 
tion  method,  the  pruning  method,  and  the  burnout  method)  for  efficiently  solving  the 
influence  maximization  problem  under  the  greedy  algorithm,  together  with  some  argu¬ 
ments  for  evaluating  the  computational  complexity  for  these  methods. 

4.1  Greedy  Algorithm 

We  approximately  solve  the  influence  maximization  problem  by  the  greedy  algorithm. 
Below  we  describe  this  algorithm  for  the  final-time  maximization  problem: 

Greedy  algorithm  for  the  final- time  maximization  problem: 

yn.  set  Hi-0. 

y\2.  For  k  —  1  to  K  do  the  following  steps: 

yi2-l.  Choose  a  node  v*  G  V  \  H  maximizing  cr(H  U  { v},  T). 

yi2-2.  Set  H  <-  H  U  {v*}. 

JH3.  Output  H. 

Here  we  can  easily  modify  this  algorithm  for  the  accumulated-time  maximization  prob¬ 
lem  by  replacing  step  yi2-l  as  follows: 

Greedy  algorithm  for  the  accumulated-time  maximization  problem: 

yn.  set  h<-0. 

yt 2.  For  k  —  1  to  K  do  the  following  steps: 

CK2-V .  Choose  a  node  v*  G  V  \  H  maximizing  Y^=\  o"(H  U  { v},  t). 
ya-2.  Set  //<-//  UK). 
yi3.  Output  H. 

Let  Hk  denote  the  set  of  K  nodes  obtained  by  this  algorithm.  We  refer  to  I! k  as  the 
greedy  solution  of  size  K.  Then,  it  is  known  that 

cr(HK,t)  >  (l  -  §) 

that  is,  the  quality  guarantee  of  Hk  is  assured  [4],  Here,  77?  is  the  exact  solution  defined 
by  Equation  (1)  or  (2). 

To  implement  the  greedy  algorithm,  we  need  a  method  for  estimating  all  the  marginal 
influence  degrees  {c t(H  U  {v},  f);  v  G  V  \  H]  of  H  in  step  yi 2-1  or  yi2-V  of  the  algo¬ 
rithm.  In  the  subsequent  subsections,  we  propose  a  method  for  efficiently  estimating  the 
influence  function  cr  over  time  span  T  for  the  SIS  model  on  network  G. 

4.2  Layered  Graph 

We  build  a  layered  graph  GT  =  (VT ,  Er )  from  G  in  the  following  way.  First,  for  each 
node  v  G  V  and  each  time-step  t  e  {0, 1,  •  •  •  ,  7’),  we  generate  a  copy  v,  of  v  at  time- 
step  t.  Let  V,  denote  the  set  of  copies  of  all  v  G  V  at  time-step  t.  We  define  V1  by 
VT  -  Vo  U  V\  U  •  •  •  U  Vj.  In  particular,  we  identify  V  with  Vq.  Next,  for  each  link 
(n,  v)  G  E,  we  generate  T  links  («,_i ,  v,),  (t  G  {1,  •  •  •  ,  T)),  in  the  set  of  nodes  VT .  We  set 
E,  =  {(ut-i,vt);  (u,v)  G  £),  and  define  ET  by  ET  =  E\  U  •  •  •  U  Ej.  Moreover,  for  any 
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link  (u,- i,vf)  of  the  layered  graph  G7 ,  we  define  the  occupation  probability  qUl_, ,V(  by 
~  Pu,V’ 

Then,  we  can  easily  prove  that  the  SIS  model  with  propagation  probabilities  {/?,,;  e  e 
E}  on  G  over  time  span  T  is  equivalent  to  the  bond  percolation  process  (BP)  with 
occupation  probabilities  { qt, ;  e  e  ET\  on  G1'  Here,  the  BP  process  with  occupation 
probabilities  [qe\e  e  ET\  on  GT  is  the  random  process  in  which  each  link  e  e  Er 
is  independently  declared  “occupied"  with  probability  qe.  We  perform  the  BP  process 
on  GT ,  and  generate  a  graph  constructed  by  occupied  links,  GT  =  (V1,  ET).  Then,  in 
terms  of  information  diffusion  by  the  SIS  model  on  G,  an  occupied  link  (ut-\ ,  vt)  e  E, 
represents  a  link  (u,  v)  e  E  through  which  the  information  propagates  at  time-step  t, 
and  an  unoccupied  link  (u,-\ ,  vt)  e  E,  represents  a  link  (u,  v)  e  E  through  which  the 
information  does  not  propagate  at  time-step  t.  For  any  v  e  V  \  II,  let  F(H  U  { v } ;  Gr) 
be  the  set  of  all  nodes  that  can  be  reached  from  H  U  {v}  e  Vo  through  a  path  on  the 
graph  GT .  When  we  consider  a  diffusion  sample  from  an  initial  active  node  v  e  V  for 
the  SIS  model  on  G,  F(H  U  { v};  GT)  n  V,  represents  the  set  of  active  nodes  at  time-step 
t,  S(H  U  {v},  t ). 

4.3  Bond  Percolation  Method 

Using  the  equivalent  BP  process,  we  present  a  method  for  efficiently  estimating  influ¬ 
ence  function  cr.  We  refer  to  this  method  as  the  BP  method.  Unlike  the  naive  method, 
the  BP  method  simultaneously  estimates  cr(H  U  {v},  t)  for  all  v  e  V  \  H.  Moreover,  the 
BP  method  does  not  fully  perform  the  BP  process,  but  performs  it  partially.  Note  first 
that  all  the  paths  from  nodes  H  U  {v}  (veV  \  H)  on  the  graph  G7  represent  a  diffusion 
sample  from  the  initial  active  nodes  //  U  {v}  for  the  SIS  model  on  G.  Let  L'  be  the  set 
of  the  links  in  GT  that  is  not  in  the  diffusion  sample.  For  calculating  .S'  (H  U  {v},  f)|,  it 
is  unnecessary  to  determine  whether  the  links  in  L'  are  occupied  or  not.  Therefore,  the 
BP  method  performs  the  BP  process  for  only  an  appropriate  set  of  links  in  GJ .  The  BP 
method  estimates  cr  by  the  following  algorithm: 

BP  method: 

51.  Set  cr(H  U  {v},  t)  <—  0  for  each  v  e  V  \  H  and  t  e  {1,  •  •  •  ,  T). 

52.  Repeat  the  following  procedure  M  times: 

S2-1.  Initialize  S(H  U  {v},0)  =  H  U  { v}  for  each  v'  e  V  \  H,  and  set  A(0)  <—  V  \  El, 
A(l)  <-  0,  •  •  • ,  A(T)  c-  0. 

S2-2.  For  t  -  1  to  T  do  the  following  steps: 

S2-2a.  Compute  B(t  -  1)  =  UveAp-i)  S(HU  {v},  t  -  1). 

S2-2b.  Perform  the  BP  process  for  the  links  from  Bit-  I )  in  G7  ,  and  generate  the  graph 
G,  constructed  by  the  occupied  links. 

S2-2c.  For  each  v  e  A(t  -  1),  compute  S(H  U  {v},  t )  =  UweS(Hu|v),r-n  U(w;  G,),  and  set 
cr(HG{v},t)  <—  cr(p[ U{v),  t)  +  |5  (//U{v'},  f)l  and  A(t)  <—  A(f)U{v'}  if  S(//U{v),  t)  ^  0. 

53.  For  each  v  e  V  \  H  and  t  e  {1,  •  •  •  ,  T),  set  cr(p[  U  { v},  t)  <—  cr(H  U  {v},  t)/M,  and 
output  cr(P[  U  {v},  t). 


3  The  SIS  model  over  time  span  T  on  G  can  be  exactly  mapped  onto  the  IC  model  on  G7  [4], 
Thus,  the  result  follows  from  the  equivalence  of  the  BP  process  and  the  IC  model  [1 1, 4, 6]. 
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Note  that  A(t)  finally  becomes  the  set  of  information  source  nodes  that  have  at  least  an 
active  node  at  time-step  t,  that  is,  A(t )  =  {v  e  V  \  H,  S  (H  U  {v},  t)  +  0).  Note  also  that 
B(t  -  1)  is  the  set  of  nodes  that  are  activated  at  time-step  t  -  1  by  some  source  nodes, 
that  is,  B(t  -  1)  =  IJvev  S(H  U  {v},  t  -  1). 

Now  we  estimate  the  computational  complexity  of  the  BP  method  in  terms  of  the 
number  of  the  nodes,  N„.  that  are  identified  in  step  232-2a,  the  number  of  the  coin-flips, 
Ni,,  for  the  BP  process  in  step  232-2b,  and  the  number  of  the  links,  Nc,  that  are  followed 
in  step  232 -2c.  Let  d(v)  be  the  number  of  out-links  from  node  v  (i.e.,  out-degree  of  v) 
and  d'(v)  the  average  number  of  occupied  out-links  from  node  v  after  the  BP  process. 
Here  we  can  estimate  d'{v)  by  2wer(v;G)  Pv,w-  Then,  for  each  time-step  t  e  {1,  •  •  •  ,T], 
we  have 

K=  Y  IS(tfu{v},f-i)|, 

veA(t-l) 

Nb=  Yj  d(w)’  (3) 

weB(f-l) 

a fc=  Y  X  c,,(w) 

veA(t- 1)  weS(ffU{v),)-l) 

on  average. 

In  order  to  compare  the  computational  complexity  of  the  BP  method  to  that  of  the 
naive  method,  we  consider  mapping  the  naive  method  onto  the  BP  framework,  that  is, 
separating  the  coin-flip  process  and  the  link-following  process.  We  can  easily  verify 
that  the  following  algorithm  in  the  BP  framework  is  equivalent  to  the  naive  method: 

A  method  that  is  equivalent  to  the  naive  method: 

231.  Set  cr(H  U  {v},  f)  «—  0  for  each  v  e  V  \  H  and  t  e  {1,  •  •  •  ,  T}. 

52.  Repeat  the  following  procedure  M  times: 

232-1.  Initialize  S(H  U  {v},0)  -HU  { v}  for  each  v  e  V  \  H,  and  set  A(0)  «—  V  \  H, 
A(l)  <-  0,  •  •  • ,  A(T)  v-  0. 

232-2.  For  t  —  1  to  T  do  the  following  steps: 

232-21)'.  For  each  v  6  A(t- 1),  perform  the  BP  process  for  the  links  from  S  (//U{v),  t—  1) 
in  Gt,  and  generate  the  graph  G,(v)  constructed  by  the  occupied  links. 

232-2c’.  For  each  v  e  A(t- 1),  compute  S  (HU{v};  t )  =  Uwes(hu{v),)-1)  T(vv;  G,(v)),  and  set 
cr(HU{v}.t)  <—  cr(HU{v},  t)  +  |5 (//U{v),  t)\  andA(f)  <—  A(t)U{v)  if  5(//U{v),  t)  0. 

53.  For  each  v  €  V  \  H  and  t  €  {1,  •  •  •  ,  T),  set  cr(H  U  { v},  t)  <—  cr(H  U  {v},  t)/M,  and 
output  cr(H  U  {)'},  t). 

Then,  for  each  t  e  {1,  •  •  •  ,  7’},  the  number  of  coin-flips,  Nb>,  in  step  232-2b’  is 

Z  2  d(w),  (4) 

veA(f-l)  weS(//Ujv),f-l) 

and  the  number  of  the  links,  Nc> ,  followed  in  step  S2-2G  is  equal  to  Nc  in  the  BP 
method  on  average.  From  equations  (3)  and  (4),  we  can  see  that  Nb'  is  much  larger  than 
NC’  -  A 4,  especially  for  the  case  where  the  diffusion  probabilities  are  small.  We  can 
also  see  that  Nb’  is  generally  much  larger  than  each  of  N„  and  Nb  in  the  BP  method  for 
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a  real  social  network.  In  fact,  since  such  a  network  generally  includes  large  clique-like 
subgraphs,  there  are  many  nodes  w  e  V  such  that  d(w )  »  1,  and  we  can  expect  that 
HveAd-1)  \S  (H  u  {v},  t  -  1)|  »  |  UveA(r-i)  S(HU  {v},  t  -  1)|  (=  | Bit  -  1)|).  Therefore,  the 
BP  method  is  expected  to  achieve  a  large  reduction  in  computational  cost. 

4.4  Pruning  Method 

In  order  to  further  improve  the  computational  efficiency  of  the  BP  method,  we  introduce 
a  pruning  technique  and  propose  a  method  referred  to  as  the  BP  with  pruning  method. 
The  key  idea  of  the  pruning  technique  is  to  utilize  the  following  property:  Once  we  have 
S(H  U  {u},  to)  —  S  (H  U  {v},  to)  at  some  time-step  to  on  the  course  of  the  BP  process  for 
a  pair  of  information  source  nodes,  u  and  v,  then  we  have  S(H  U  {u},  t)  —  S(H  U  {v},  t) 
for  all  t  >  to-  The  BP  with  pruning  method  estimates  cr  by  the  following  algorithm: 

BP  with  pruning  method: 

51.  Set  o~(H  U  {v},  t)  <—  0  for  each  v  e  V  \  H  and  t  e  {1,  •  •  •  ,  T). 

52.  Repeat  the  following  procedure  M  times: 

S2-1”.  Initialize  S(H  U  {v};  0)  =  H  U  {v}  for  each  v  e  V  \  H,  and  set  A(0)  <—  V  \  H, 

A(l)  «—  0,  •  ■  ■ ,  A(T)  <—  0,  and  C(v)  <—  {v}  for  each  v  6  V  \  H. 

S2-2.  For  t  —  1  to  T  do  the  following  steps: 

S2-2a.  Compute  B(t  -  1)  =  UveAq-i)  S(HU  {v},  t  -  1). 

S2-2b.  Perform  the  BP  process  for  the  links  from  Bit-  I )  in  Gr,  and  generate  the  graph 
G,  constructed  by  the  occupied  links. 

S2-2c”.  For  each  v  e  A(t  -  1),  compute  S(H  U  {v),t)  =  UweS(mj{v),r-i)  r(yv,  Gt),  set 
A(t)  <—  A(t)  U  {v}  if  5  (FfU{v),  t)  +  0,  and  set  cr(HL)  [u],  t)  <—  cr(H U  { w } ,  t)  +  \S(HG 
{v},  t) |  for  each  u  e  C(v). 

S2-2d.  Check  whether  S(H  U  {u),t)  —  S(H  U  { v} ,  f )  for  u.  v  e  A(t),  and  set  C(v)  <— 
C(v)  U  C(u)  and  A(t)  <-  Ait)  \  \u)  if  .S’ (HU  {«},  t)  =  S(H  U  {v},  t). 

53.  For  each  v  e  V  \  H  and  t  e  {1,  •  •  •  ,  T),  set  cr(H  U  { v},  t)  <—  cr(H  U  {v},  t)/M,  and 

output  cr(H  U  {v},  t). 

Basically,  by  introducing  step  S2-2cl  and  reducing  the  size  of  Ait),  the  proposed  method 
attempts  to  improve  the  computational  efficiency  in  comparison  to  the  original  BP 
method.  For  the  proposed  method,  it  is  important  to  implement  efficiently  the  equiva¬ 
lence  check  process  in  step  S2-2d.  In  our  implementation,  we  first  classify  each  v  e  Ait) 
according  to  the  value  of  n  —  \SiH  jJ  {v},  f)|,  and  then  perform  the  equivalence  check 
process  only  for  those  nodes  with  the  same  n  value. 

4.5  Burnout  Method 

In  order  to  further  improve  the  computational  efficiency  of  the  BP  with  pruning  method, 
we  additionally  introduce  a  burnout  technique  and  propose  a  method  referred  to  as 
the  BP  with  pruning  and  burnout  method.  More  specifically,  we  focus  on  the  fact  that 
maximizing  the  marginal  influence  degree  cr(H  U  { i' } ,  t)  with  respect  to  v  e  V  \  H  is 
equivalent  to  maximizing  the  marginal  influence  gain  <pn(v,  t)  =  cr(H  U  { v},  1)  -  cr(H,  t). 
Here  on  the  course  of  the  BP  process  for  a  newly  added  information  source  node  v, 
maximizing  (buiv,  t)  reduces  to  maximizing  .S’  (H  U  { v},  t)\S  (H,  f)|  on  average.  The  BP 
with  pruning  and  burnout  method  estimates  (pH  by  the  following  algorithm: 
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BP  with  pruning  and  burnout  methods: 

Cl.  Set  <Ph{v,  t)  <—  0  for  each  v  e  V  \  H  and  t  e  {1,  •  •  •  ,  T). 

C2.  Repeat  the  following  procedure  M  times: 

C2-1.  Initialize  S  (H:  0)  =  H,  and  S  ({v};  0)  =  jij  for  each  ve  V  \  H,  and  set  ,4(0)  <— 
V\H,  A(l)  <—©,•■•,  A(T)  <—  0,  and  C(v)  <—  [v]  for  each  v  e  V  \  H. 

C2-2.  For  t  —  1  to  T  do  the  following  steps: 

C2-2a.  Compute  B(t  -  1)  =  UveAq-i)  S  ({v},  t  -  1)  U  S  (H,  t  -  1). 

C2-2b.  Perform  the  BP  process  for  the  links  from  B(t- 1)  in  G’7  ,  and  generate  the  graph 
G,  constructed  by  the  occupied  links. 

C2-2c.  Compute  S(Hj)  =  Uw€S(fl,i-i)  G(),  and  for  each  v  e  A(t  -  1),  compute 
s  (M,  t)  =  U«€S(iv),r-t)  r(w;  Gt)  \  s  (H,  t ),  set  4(f)  <-  A(t )  U  {v}  if  S  ({v},  f)  +  0,  and 
set  cf>H({u},  f)  <—  (/>H({u},  t )  +  |5  ({v},  f)|  for  each  u  e  C(v). 

C2-2d.  Check  whether  S  ({m),  t)  =  S  ({v},  t)  for  u,  v  e  A(t),  and  set  C(v)  <—  C(v)  U  C(u ) 
and  4(f)  <-  A(t )  \  {n}  if  S«n},  f)  =  5({v},  f). 

C3.  For  each  v  e  V  \  H  and  f  e  {1,  ■  ■  ■  ,  T),  set  0//({v),  f)  <—  4>h([v),  t)/M ,  and  output 

0»(M,f). 

Intuitively,  compared  with  the  BP  with  pruning  method,  by  using  the  burnout  technique, 
we  can  substantially  reduce  the  size  of  the  active  node  set  from  S(H  U  (v),  f)  to  S  ({v},  f) 
for  each  v  e  V\H  and  f  e  {1,  •  •  •  ,  T).  Namely,  in  terms  of  computational  costs  described 
by  Equation  (3),  we  can  expect  to  obtain  smaller  numbers  for  Na  and  Nc  when  //  4-  0. 
However,  how  effectively  the  proposed  method  works  will  depend  on  several  conditions 
such  as  network  structure,  time  span,  values  of  diffusion  probabilities,  and  so  on.  We 
will  do  a  simple  analysis  later  and  experimentally  show  that  it  is  indeed  effective. 


5  Experimental  Evaluation 

In  the  experiments,  we  report  our  evaluation  results  on  the  final-time  maximization 
problem  due  to  the  space  limitation. 

5.1  Network  Data  and  Settings 

In  our  experiments,  we  employed  two  datasets  of  large  real  networks  used  in  [10],  which 
exhibit  many  of  the  key  features  of  social  networks. 

The  first  one  is  a  trackback  network  of  lapanese  blogs.  The  network  data  was  col¬ 
lected  by  tracing  the  trackbacks  from  one  blog  in  the  site  “goo  (http://blog.goo.ne.jp/)” 
in  May,  2005.  We  refer  to  the  network  data  as  the  blog  network.  The  blog  network  was 
a  strongly-connected  bidirectional  network,  where  a  link  created  by  a  trackback  was 
regarded  as  a  bidirectional  link  since  blog  authors  establish  mutual  communications 
by  putting  trackbacks  on  each  other’s  blogs.  The  blog  network  had  12,047  nodes  and 
79, 920  directed  links. 

The  second  one  is  a  network  of  people  that  was  derived  from  the  “list  of  people” 
within  Japanese  Wikipedia.  We  refer  to  the  network  data  as  the  Wikipedia  network.  The 
Wikipedia  network  was  also  a  strongly-connected  bidirectional  network,  and  had  9, 48 1 
nodes  and  245, 044  directed  links. 
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For  the  SIS  model,  we  assigned  a  uniform  probability  p  to  the  propagation  proba¬ 
bility  puv  for  any  link  (n,v)  e  E ,  that  is,  puv  =  p.  According  to  [4,2],  we  set  the  value 
of  p  relatively  small.  In  particular,  we  set  the  value  of  p  to  a  value  smaller  than  \/d, 
where  d  is  the  mean  out-degree  of  a  network.  Since  the  values  of  d  were  about  6.63  and 
25.85  for  the  blog  and  the  Wikipedia  networks,  respectively,  the  corresponding  values 
of  I  Id  were  about  0.15  and  0.03.  We  decided  to  set  p  -  0. 1  for  the  blog  network  and 
p  =  0.01  for  the  Wikipedia  network.  Also,  for  the  time  span  T,  we  set  T  -  30. 

For  the  bond  percolation  method,  we  need  to  specify  the  number  M  of  performing 
the  bond  percolation  process.  According  to  [12],  we  set  M  =  10,000  for  estimating 
influence  degrees  for  the  blog  and  Wikipedia  networks. 

All  our  experimentation  was  undertaken  on  a  single  PC  with  an  Intel  Dual  Core 
Xeon  X5272  3.4GHz  processor,  with  32GB  of  memory,  running  under  Linux. 

5.2  Comparison  Methods 

First,  we  compared  the  proposed  method  with  three  heuristics  from  social  network  anal¬ 
ysis  with  respect  to  the  solution  quality.  They  are  based  on  the  notions  of  “degree  cen¬ 
trality”,  “closeness  centrality”,  and  “betweenness  centrality”  that  are  commonly  used  as 
influence  measure  in  sociology  [13].  Here,  the  betweenness  of  node  v  is  defined  as  the 
total  number  of  shortest  paths  between  pairs  of  nodes  that  pass  through  v,  the  closeness 
of  node  v  is  defined  as  the  reciprocal  of  the  average  distance  between  v  and  other  nodes 
in  the  network,  and  the  degree  of  node  v  is  defined  as  the  number  of  links  attached  to  v. 
Namely,  we  employed  the  methods  of  choosing  nodes  in  decreasing  order  of  these  cen¬ 
tralities.  We  refer  to  these  methods  as  the  betweenness  method ,  the  closeness  method, 
and  the  degree  method ,  respectively. 

Next,  to  evaluate  the  effectiveness  of  the  pruning  and  the  burnout  strategies,  we 
compared  the  proposed  method  with  the  naive  greedy  method  based  on  the  BP  method 
with  respect  to  the  processing  time.  Hereafter,  we  refer  to  the  naive  greedy  method 
based  on  the  BP  method  as  the  BP  method  for  short. 

5.3  Solution  Quality  Comparison 

We  first  compared  the  quality  of  the  solution  Hk  of  the  proposed  method  with  that 
of  the  betweenness,  the  closeness,  and  the  degree  methods  for  solving  the  problem  of 
the  influence  maximization  at  the  final  time  step  T.  Clearly,  the  quality  of  I! k  can  be 
evaluated  by  the  influence  degree  ct(Hk,  T).  We  estimated  the  value  of  ct(Hk,  T)  by 
using  the  bond  percolation  method  with  M  -  10, 000  according  to  [12], 

Figures  1  and  2  show  the  influence  degree  ct(Hk,  T)  as  a  function  of  the  number  of 
initial  active  nodes  K  for  the  blog  and  the  Wikipedia  networks,  respectively.  In  the  fig¬ 
ures,  the  circles,  triangles,  diamonds,  and  squares  indicate  the  results  for  the  proposed, 
the  betweenness,  the  closeness,  and  the  degree  methods,  respectively.  The  proposed 
method  performs  the  best  for  both  networks,  while  the  betweenness  method  follows  for 
the  blog  dataset  and  the  degree  method  follows  for  the  Wikipedeia  dataset.  Note  that 
how  each  of  the  conventional  heuristics  performs  depends  on  the  characteristics  of  the 
network  structure.  These  results  imply  that  the  proposed  method  works  effectively,  and 
outperforms  the  conventional  heuristics  from  social  network  analysis. 
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number  of  initial  active  nodes 


Fig.  1.  Comparison  of  solution  quality  for  the  blog  network. 


It  is  interesting  to  note  that  the  k  nodes  (k  -  1,2, K)  that  are  discovered  to  be 
most  influential  by  the  proposed  method  are  substantially  different  from  those  that  are 
found  by  the  conventional  centrality-based  heuristic  methods.  For  example,  the  best 
node  (k  —  1)  chosen  by  the  proposed  method  for  the  blog  dataset  is  ranked  118  for  the 
betweenness  method,  659  for  the  closeness  method  and  6  for  the  degree  method,  and 
the  15th  node  ( k  =  15)  by  the  proposed  method  is  ranked  1373,  8848  and  507  for  the 
corresponding  conventional  methods,  respectively.  The  best  node  (k  =  1)  chosen  by  the 
proposed  method  for  the  Wikipedia  dataset  is  ranked  580  for  the  betweenness  method, 
2766  for  the  closeness  method  and  15  for  the  degree  method,  and  the  15th  node  ( k  =  15) 
by  the  proposed  method  is  ranked  265,  2041,  and  21  for  the  corresponding  conventional 
methods,  respectively.  It  is  hard  to  find  a  correlation  between  these  rankings,  but  for  the 
smaller  k,  it  appears  that  degree  centrality  measure  is  better  than  the  other  centrality 
measures,  which  can  be  inferred  from  Figures  1  and  2. 

5.4  Processing  Time  Comparison 

Next,  we  compared  the  processing  time  of  the  proposed  method  (BP  with  pruning  and 
burnout  method)  with  that  of  the  BP  method.  Let  t(K ,  T)  denote  the  processing  time  of 
a  method  for  solving  the  problem  of  the  influece  maximization  at  the  final  time  step  T, 
where  K  is  the  number  of  initial  active  nodes.  Figures  3  and  4  show  the  processing  time 
difference  At(K,  T)  —  t(K,  T )  -  t(K  -  1 ,  T)  as  a  function  of  the  number  of  initial  active 
nodes  K  for  the  blog  and  the  Wikipedia  networks,  respectively.  In  these  figures,  the  cir¬ 
cles,  and  crosses  indicate  the  results  for  the  proposed  and  the  BP  methods,  respectively. 
Note  that  At{K,  T)  decreases  as  K  increases  for  the  proposed  method,  whereas  At(K,  T) 
increases  for  the  BP  method.  This  means  that  the  difference  in  the  total  processing  time 
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Fig.  2.  Comparison  of  solution  quality  for  the  Wikipedia  network. 


becomes  increasingly  larger  as  K  increases.  In  case  of  the  blog  dataset,  the  total  pro¬ 
cessing  time  for  K  =  5  is  about  2  hours  for  the  proposed  method  and  100  hours  for  the 
BP  methods.  Namely,  the  proposed  method  is  about  50  times  faster  than  the  BP  method 
for  K  -  5.  The  same  is  true  for  the  Wikipedia  dataset.  The  total  processing  time  for 
K  -  5  is  about  0.5  hours  for  the  proposed  method  and  9  hours  the  BP  methods,  and  the 
proposed  method  is  about  18  times  faster  than  the  BP  method  for  K  =  5.  These  results 
confirm  that  the  proposed  method  is  much  more  efficient  than  the  BP  method,  and  can 
be  practical. 


6  Discussion 

The  influence  function  <x(-,  T)  is  submodular  [4].  For  solving  a  combinatorial  optimiza¬ 
tion  problem  of  a  submodular  function  /  on  V  by  the  greedy  algorithm,  Leskovec  et 
al.  [7]  have  recently  presented  a  lazy  evaluation  method  that  leads  to  far  fewer  (ex¬ 
pensive)  evaluations  of  the  marginal  increments  f(H  U  {v})  -  /(//),  (v  e  V  \  77)  in  the 
greedy  algorithm  for  H  +  0,  and  achieved  an  improvement  in  speed.  Note  here  that  their 
method  requires  evaluating  f(v)  for  all  v  e  V  at  least.  Thus,  we  can  apply  their  method 
to  the  influence  maximization  problem  for  the  SIS  model,  where  the  influence  function 
<t(-,  T)  is  evaluated  by  simulating  the  corresponding  random  process.  It  is  clear  that  1) 
this  method  is  more  efficient  than  the  naive  greedy  method  that  does  not  employ  the 
BP  method  and  instead  evaluates  the  influence  degrees  by  simulating  the  diffusion  phe¬ 
nomena,  and  2)  further  the  both  methods  become  the  same  for  K  =  1  and  empirically 
estimate  the  influence  function  cr(-,  7  )  by  probabilistic  simulations.  These  methods  also 
require  M  to  be  specified  in  advance  as  a  parameter,  where  M  is  the  number  of  Simula- 
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Fig.  3.  Comparison  of  processing  time  for  the  blog  network. 


tions.  Note  that  the  BP  and  the  simulation  methods  can  estimate  influence  degree  <x(v,  t) 
with  the  same  accuracy  by  using  the  same  value  of  M  (see  [12]).  Moreover,  as  shown 
in  [12],  estimating  influence  function  cr(-,  30)  by  10, 000  simulations  needed  more  than 
35.8  hours  for  the  blog  dataset  and  7.6  hours  for  the  Wikipedia  dataset,  respectively. 
However,  the  proposed  method  for  K  =  30  needed  less  than  7.0  hours  for  the  blog 
dataset  and  3.2  hours  for  the  Wikipedia  dataset,  respectively.  Therefore,  it  is  clear  that 
the  proposed  method  can  be  faster  than  the  method  by  Leskovec  [7]  for  the  influence 
maximization  problem  for  the  SIS  model. 


7  Conclusion 

Finding  influential  nodes  is  one  of  the  most  central  problems  in  the  field  of  social  net¬ 
work  analysis.  There  are  several  models  that  simulate  how  various  things,  e.g.,  news, 
rumors,  diseases,  innovation,  ideas,  etc.  diffuse  across  the  network.  One  such  realis¬ 
tic  model  is  the  susceptible/infected/susceptible  (SIS)  model,  an  information  diffusion 
model  where  nodes  are  allowed  to  be  activated  multiple  times.  The  computational  com¬ 
plexity  drastically  increases  because  of  this  multiple  activation  property,  e.g.,  compared 
with  the  susceptible/infected/recovered  (SIR)  model  where  once  activated  nodes  can 
never  be  deactivated/reactivated.  We  addressed  the  problem  of  efficiently  discovering 
the  influential  nodes  under  the  SIS  model,  i.e.,  estimating  the  expected  number  of  acti¬ 
vated  nodes  at  time-step  t  for  t  -  1 ,  •  •  ■  ,  T  starting  from  an  initially  activated  node  set 
H  €  V  at  time-step  t  =  0.  We  solved  this  problem  by  constructing  a  layered  graph  from 
the  original  social  network  by  adding  each  layer  on  top  of  the  existing  layers  as  the 
time  proceeds,  and  applying  the  bond  percolation  with  a  pruning  strategy.  We  showed 
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Fig.  4.  Comparison  of  processing  time  for  the  Wikipedia  network. 


that  the  computational  complexity  of  the  proposed  method  is  much  smaller  than  the 
conventional  naive  probabilistic  simulation  method  by  a  theoretical  analysis.  We  ap¬ 
plied  the  proposed  method  to  two  different  types  of  influence  maximization  problem, 
i.e.  discovering  the  K  most  influential  nodes  that  together  maximize  the  expected  influ¬ 
ence  degree  at  the  time  of  interest  or  the  expected  influence  degree  over  the  time  span 
of  interest.  Both  problems  are  solved  by  the  greedy  algorithm  taking  advantage  of  the 
submodularity  of  the  objective  function.  We  confirmed  by  applying  to  two  real  world 
networks  taken  from  blog  and  Wikipedia  data  that  the  proposed  method  can  achieve 
considerable  reduction  of  computation  time  without  degrading  the  accuracy  compared 
with  the  naive  simulation  method,  and  discover  nodes  that  are  more  influential  than  the 
nodes  identified  by  the  conventional  methods  based  on  the  various  centrality  measures. 
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Abstract.  We  address  the  problem  of  estimating  the  parameters  for  a  continu¬ 
ous  time  delay  independent  cascade  (CTIC)  model,  a  more  realistic  model  for 
information  diffusion  in  complex  social  network,  from  the  observed  information 
diffusion  data.  For  this  purpose  we  formulate  the  rigorous  likelihood  to  obtain 
the  observed  data  and  propose  an  iterative  method  to  obtain  the  parameters  (time- 
delay  and  diffusion)  by  maximizing  this  likelihood.  We  apply  this  method  first  to 
the  problem  of  ranking  influential  nodes  using  the  network  structure  taken  from 
two  real  world  web  datasets  and  show  that  the  proposed  method  can  predict  the 
high  ranked  influential  nodes  much  more  accurately  than  the  well  studied  con¬ 
ventional  four  heuristic  methods,  and  second  to  the  problem  of  evaluating  how 
different  topics  propagate  in  different  ways  using  a  real  world  blog  data  and  show 
that  there  are  indeed  differences  in  the  propagation  speed  among  different  topics. 


1  Introduction 

The  rise  of  the  Internet  and  the  World  Wide  Web  accelerates  the  creation  of  various 
large-scale  social  networks,  and  considerable  attention  has  been  brought  to  social  net¬ 
works  as  an  important  medium  for  the  spread  of  information  [1-5].  Innovation,  topics 
and  even  malicious  rumors  can  propagate  through  social  networks  in  the  form  of  so- 
called  “word-of-mouth”  communications.  This  forms  a  virtual  society  forming  various 
kinds  of  communities.  Just  like  a  real  world  society,  some  community  grows  rapidly 
and  some  other  shrinks.  Likewise,  some  information  propagates  quickly  and  some  other 
only  slowly.  Good  things  remain  and  bad  things  diminish  as  if  there  is  a  natural  selec¬ 
tion.  The  social  network  offers  a  nice  platform  to  study  a  mechanism  of  society  dy¬ 
namics  and  behavior  of  humans,  each  as  a  member  of  the  society.  In  this  paper,  we 
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address  the  problem  of  how  information  diffuses  through  the  social  network,  in  partic¬ 
ular  how  different  topics  propagate  differently  by  inducing  a  diffusion  model  that  can 
handle  continuous  time  delay. 

There  are  several  models  that  simulate  information  diffusion  through  a  network. 
A  widely-used  model  is  the  independent  cascade  (IC),  a  fundamental  probabilistic 
model  of  information  diffusion  [6,7],  which  can  be  regarded  as  the  so-called  suscepti¬ 
ble/infected/recovered  (SIR)  model  for  the  spread  of  a  disease  [2],  This  model  has  been 
used  to  solve  such  problems  as  the  influence  maximization  problem  which  is  to  find  a 
limited  number  of  nodes  that  are  influential  for  the  spread  of  information  [7, 8]  and  the 
influence  minimization  problem  which  is  to  suppress  the  spread  of  undesirable  informa¬ 
tion  by  blocking  a  limited  number  of  links  [9],  The  IC  model  requires  the  parameters 
that  represent  diffusion  probabilities  through  links  to  be  specified  in  advance.  Since 
the  true  values  of  the  parameters  are  not  available  in  practice,  this  poses  yet  another 
problem  of  estimating  them  from  the  observed  data  [10], 

One  of  the  drawbacks  of  the  IC  model  is  that  it  cannot  handle  time-delays  for  infor¬ 
mation  propagation,  and  we  need  a  model  to  explicitly  represent  time  delay.  Gruhl  et  al. 
is  the  first  to  extend  the  IC  model  to  include  the  time -delay  [3],  Their  model  now  has 
the  parameters  that  represent  time-delays  through  links  as  well  as  the  parameters  that 
represent  diffusion  probabilities  through  links.  They  presented  a  method  for  estimating 
the  parameter  values  from  the  observed  data  using  an  EM-like  algorithm,  and  experi¬ 
mentally  showed  its  effectiveness  using  sparse  Erdos-Renyi  networks.  However,  it  is  not 
clear  what  they  are  optimizing  in  deriving  the  update  formulas  of  the  parameter  values. 
Further,  they  treated  the  time  as  a  discrete  variable,  which  means  that  it  is  assumed  that 
information  propagate  in  a  synchronized  way  in  a  sense  that  each  node  can  be  activated 
only  at  a  specific  time.  In  reality,  time  flows  continuously  and  thus  information,  too, 
propagates  on  this  continuous  time  axis.  For  any  node,  information  must  be  able  to  be 
received  at  any  time  from  other  nodes  and  must  be  allowed  to  propagate  to  yet  other 
nodes  at  any  other  time,  both  in  an  asynchronous  way.  Thus,  for  a  realistic  behavior 
analyses  of  information  diffusion,  we  need  to  adopt  a  model  that  explicitly  represents 
continuous  time  delay. 

In  this  paper,  we  deal  with  an  information  diffusion  model  that  incorporates  con¬ 
tinuous  time  delay  based  on  the  IC  model  (referred  to  as  CTIC  model),  and  propose 
a  novel  method  for  estimating  the  values  of  the  parameters  in  the  model  from  a  set  of 
information  diffusion  results  that  are  observed  as  time-sequences  of  infected  (active) 
nodes.  What  makes  this  problem  difficult  is  that  incorporating  time-delay  makes  the 
time-sequence  observation  data  structural.  There  is  no  way  of  knowing  from  the  data 
which  node  activated  which  other  node  that  comes  later  in  the  sequence.  We  introduce 
an  objective  function  that  rigorously  represents  the  likelihood  of  obtaining  such  ob¬ 
served  data  sequences  under  the  CTIC  model  on  a  given  network,  and  derive  an  iterative 
algorithm  by  which  the  objective  function  is  maximized.  First  we  test  the  convergence 
performance  of  the  proposed  method  by  applying  it  to  the  problem  of  ranking  influen¬ 
tial  nodes  using  the  network  structure  taken  from  two  real  world  web  datasets  and  show 
that  the  parameters  converge  to  the  correct  values  by  the  iterative  procedure  and  can 
predict  the  high  ranked  influential  nodes  much  more  accurately  than  the  well  studied 
conventional  four  heuristic  methods.  Second  we  apply  the  method  to  the  problem  of  be- 
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havioral  analysis  of  topic  propagation,  i.e.,  evaluating  how  different  topics  propagate  in 
different  ways,  using  a  real  world  blog  data  and  show  that  there  are  indeed  differences 
in  the  propagation  speed  among  different  topics. 


2  Information  Diffusion  Model  and  Learning  Problem 

We  first  define  the  IC  model  according  to  [7],  and  then  introduce  the  continuous-time 
IC  model.  After  that,  we  formulate  our  learning  problem. 

We  mathematically  model  the  spread  of  information  through  a  directed  network  G 
=  ( V, E )  without  self-links,  where  V  and  E  ( c  V  x  V)  stands  for  the  sets  of  all  the 
nodes  and  links,  respectively.  We  call  nodes  active  if  they  have  been  influenced  with 
the  information.  In  the  model,  it  is  assumed  that  nodes  can  switch  their  states  only  from 
inactive  to  active,  but  not  from  active  to  inactive.  Given  an  initial  set  S  of  active  nodes, 
we  assume  that  the  nodes  in  S  have  first  become  active  at  an  initial  time,  and  all  the 
other  nodes  are  inactive  at  that  time. 

In  this  paper,  node  u  is  called  a  child  node  of  node  v  if  (v,  u )  e  E,  and  node  u  is 
called  a  parent  node  of  node  v  if  (m,  v)  e  E.  For  each  node  v  e  V,  let  E(v)  and  B(y) 
denote  the  set  of  child  nodes  of  v  and  the  set  of  parent  nodes  of  v,  respectively, 

F(v)  =  {w  e  V;  (v,  w)  e  E),  B(v)  -  {u  e  V;  ( u ,  v)  e  E}. 

2.1  Independent  Cascade  Model 

Let  us  describe  the  definition  of  the  IC  model.  In  this  model,  for  each  link  ( u,v ),  we 
specify  a  real  value  v  with  0  <  Au  v  <  1  in  advance.  Here  Au  v  is  referred  to  as  the 
diffusion  probability  through  link  ( u ,  v). 

The  diffusion  process  unfolds  in  discrete  time-steps  t  >  0,  and  proceeds  from  a 
given  initial  active  set  S  in  the  following  way.  When  a  node  u  becomes  active  at  time- 
step  t,  it  is  given  a  single  chance  to  activate  each  currently  inactive  child  node  v,  and 
succeeds  with  probability  /l„  v.  If  u  succeeds,  then  v  will  become  active  at  time-step  t+ 1. 
If  multiple  parent  nodes  of  v  become  active  at  time-step  t,  then  their  activation  attempts 
are  sequenced  in  an  arbitrary  order,  but  all  performed  at  time-step  t.  Whether  or  not  u 
succeeds,  it  cannot  make  any  further  attempts  to  activate  v  in  subsequent  rounds.  The 
process  terminates  if  no  more  activations  are  possible. 

2.2  Continuous-Time  Independent  Cascade  Model 

Next,  we  extend  the  IC  model  so  as  to  allow  continuous-time  delays,  and  refer  to  the 
extended  model  as  the  continuous-time  independent  cascade  ( CTIC )  model. 

In  the  CTIC  model,  for  each  link  ( u ,  v)  e  E,  we  specify  real  values  ru  v  and  ku  v 
with  ru  v  >  0  and  0  <  ku  v  <  1  in  advance.  We  refer  to  rU  Y  and  ku  v  as  the  time-delay 
parameter  and  the  diffusion  parameter  through  link  (m,  v),  respectively. 

The  diffusion  process  unfolds  in  continuous-time  f,  and  proceeds  from  a  given  initial 
active  set  S  in  the  following  way.  Suppose  that  a  node  u  becomes  active  at  time  t.  Then, 
node  u  is  given  a  single  chance  to  activate  each  currently  inactive  child  node  v.  We 
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choose  a  delay-time  6  from  the  exponential  distribution  with  parameter  rll  v.  If  node  v 
is  not  active  before  time  t  +  6,  then  node  u  attempts  to  activate  node  v,  and  succeeds 
with  probability  ku  v.  If  u  succeeds,  then  v  will  become  active  at  time  t  +  6.  Under 
the  continuous  time  framework,  it  is  unlikely  that  multiple  parent  nodes  of  v  attempt 
to  activate  v  for  the  activation  at  time  t  +  5.  But  if  they  do,  their  activation  attempts 
are  sequenced  in  an  arbitrary  order.  Whether  or  not  u  succeeds,  it  cannot  make  any 
further  attempts  to  activate  v  in  subsequent  rounds.  The  process  terminates  if  no  more 
activations  are  possible. 

For  an  initial  active  set  S ,  let  ip(S)  denote  the  number  of  active  nodes  at  the  end  of 
the  random  process  for  the  CTIC  model.  Note  that  <p(S )  is  a  random  variable.  Let  cr(S) 
denote  the  expected  value  of  ip(S ).  We  call  cr(S )  the  influence  degree  of  5  for  the  CTIC 
model. 

2.3  Learning  problem 

For  the  CTIC  model  on  network  G,  we  define  the  time-delay  parameter  vector  r  and  the 
diffusion  parameter  vector  k  by 

r  —  (rH,v)(Miv)6£>  K  =  ( Ku,v)(u,v)eE • 

In  practice,  the  true  values  of  r  and  k  are  not  available.  Thus,  we  must  estimate  them 
from  past  information  diffusion  histories  observed  as  sets  of  active  nodes. 

We  consider  an  observed  data  set  of  M  independent  information  diffusion  results, 

£>m  =  {Dm\  m=  l,---  ,M). 

Here,  each  Dm  is  a  time-sequence  of  active  nodes  in  the  with  information  diffusion  result, 

Dm  —  ( Dm(t ),  t  E  Dm ) ,  Dm  —  ,  ,  Tm), 

where  Dm{t)  is  the  set  of  all  the  nodes  that  have  first  become  active  at  time  t,  and  Dm 
is  the  observation-time  list;  tm  is  the  observed  initial  time  and  Tm  is  the  observed  final 
time.  We  assume  that  for  any  active  node  v  in  the  with  information  diffusion  result,  there 
exits  some  t  e  Dm  such  that  v  e  Dm{t).  Let  tnuv  denote  the  time  at  which  node  v  becomes 
active  in  the  wzth  information  diffusion  result,  i.e.,  v  e  Dm(tmv).  For  any  t  e  Dm,  we  set 

Cm{t)  =  [J  Dm(f) 

Note  that  Cm(t)  is  the  set  of  active  nodes  before  time  t  in  the  wzth  information  diffusion 
result.  We  also  interpret  Dm  as  referring  to  the  set  of  all  the  active  nodes  in  the  with  in¬ 
formation  diffusion  result  for  convenience  sake.  In  this  paper,  we  consider  the  problem 
of  estimating  the  values  of  r  and  k  from 

3  Proposed  Method 

We  explain  how  we  estimate  the  values  of  r  and  k  from  Dm-  Here,  we  limit  ourselves  to 
outline  the  derivations  of  the  proposed  method  due  to  the  lack  of  space.  We  also  briefly 
mention  how  we  do  behavioral  analysis  with  the  method. 
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3.1  Likelihood  function 


For  the  learning  problem  described  above,  we  strictly  derive  the  likelihood  function 
J2(r,  Dm)  with  respect  to  r  and  k  to  use  as  our  objective  function. 

First,  we  consider  any  node  v  e  Dm  with  tm  v  >  0  for  the  wth  information  diffusion 
result.  Let  denote  the  probability  density  that  a  node  u  e  B(v)  n  Cm(tm,v)  activates 
the  node  v  at  time  t„uv,  that  is. 


—  ^u.y^u.v  CXp(  ^uy(tm,v  hn.it))  ■ 


(1) 


Let  BmMA  denote  the  probability  that  the  node  v  is  not  activated  from  a  node  u  e  B(v)  n 
Cm(tm.v)  within  the  time-period  [tmM,  tmy],  that  is. 


Bn 


=  1  -  Ku 


■JO 


exp (-ru,v(t  ~  tm,u))dt 


—  Ku.v  exp(  r nyitfn.v  hn,u))  "f  (1  KUfV). 


(2) 


If  there  exist  multiple  active  parents  for  the  node  v,  i.e.,  ?/  =  \B(v)  n  Cm(tmy)\  >  1,  we 
need  to  consider  possibilities  that  each  parent  node  succeeds  in  activating  v  at  time  tmy. 
However,  in  case  of  the  continuous  time  delay  model,  we  can  ignore  simultaneous  acti¬ 
vations  by  multiple  active  parents  due  to  the  continuous  property.  Thus,  the  probability 
density  that  the  node  v  is  activated  at  time  fm  v,  denoted  by  hmy,  can  be  expressed  as 


hm.v  ~  ^  ^  ^m.u.v  j  j 

ueB{v)C\Cm{tmjV )  \XGfi(v)flC,„(fW;V)\{M} 


B„ 


j  [  m,u,v  (hBm,uyTX. 


xeB(v)DCm(tmyV )  ueB(v)r\Cm(tmtV ) 


(3) 


Note  that  we  are  not  able  to  know  which  node  u  actually  activated  the  node  v.  This  can 
be  regarded  as  a  hidden  structure. 

Next,  for  the  mth  information  diffusion  result,  we  consider  any  link  (v,w)  e  E  such 
that  v;  e  Cm{Tm)  and  w  i  Dm.  Let  gm.v,w  denote  the  probability  that  the  node  w  is  not 
activated  by  the  node  v  within  the  observed  time  period  \tm,  Tm],  We  can  easily  derive 
the  following  equation: 


gm,v,w  Kv  y:  CX p(  F\\w(Tm  hnt.v))  T  (1  ^-v.w)'  (A) 

Here  we  can  naturally  assume  that  each  information  diffusion  process  finished  suffi¬ 
ciently  earlier  than  the  observed  final  time,  i.e.,  Tm  »  max{/;  Dm(t)  +  0).  Thus,  as 
Tm  — >  oo  in  equation  (4),  we  assume 


&m,v,w 


1  ~K 


v,w  • 


(5) 


Therefore,  by  using  equations  (3),  (5),  and  the  independence  properties,  we  can 
define  the  likelihood  function  £.{r,  k.  Dm)  with  respect  to  r  and  k  by 


£(r,  k;  Dm) 


M 


n 

m- 1 


I  I  I  I  ^m,v  |  |  |  |  &m,v,w 

jeTm  veDm(t )  veDm  weF(v)\Dm  > 


(6) 
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Here,  we  retained  the  product  with  respect  to  v  e  D,„(t)  for  completeness,  but  in  practice 
there  is  only  one  v  in  Dm{t). 

In  this  paper,  we  focus  on  the  above  situation  (i.e.,  equation  (5))  for  simplicity,  but 
we  can  easily  modify  our  method  to  cope  with  the  general  one  (i.e.,  equation  (4)).  Thus, 
our  problem  is  to  obtain  the  values  of  r  and  k,  which  maximize  equation  (6).  For  this 
estimation  problem,  we  derive  a  method  based  on  an  iterative  algorithm  in  order  to 
stably  obtain  its  solution. 


3.2  Estimation  method 


We  describe  our  estimation  method.  Let  r  =  (ruv)  and  k  =  (ku  v)  be  the  current  estimates 
of  r  and  k,  respectively.  For  each  v  e  Dm  and  u  €  B(v)  n  Cm(tm.v),  we  define  amM,v  by 


2  *  m,x,v(.-^m,x,v) 

xsB(y)C\Cm(tmtV ) 


(7) 


Let  y .  Bfn  n  y •  h m.\--  and  Q?tn,u,v  denote  the  values  of  ^m,v>  and  cym  u  v 

calculated  by  using  r  and  k,  respectively. 

From  equations  (3),  (5),  (6),  we  can  transform  our  objective  function  £.(r,  k\  Dm) 
as  follows: 

log  £(r,  k;  Du)  =  Q(r,  k;  r,  k)  -  H(r,  k ;  r,  k),  (8) 

where  Q(r,  k\  r,  k)  is  defined  by 


M  ! 


Q(r,K-,r,K)  =  Y  Yj  Xj  Q'»-'’+Yj  Tj  log(1  “  Ww)  , 

m=  1  \teTm  veDm(t )  veDm  weF(v)\Dm  > 

Qm,v  ~  ^  ^  ^  ^  &m,u,v  log  )  (9) 


ueB(v)nCm(tmtV) 

and  H(r ,  k ;  r,  k)  is  defined  by 


ueB(y)r\Cm(tm,v) 


M 


H(r ,  k;  r,  k)  =  Y  Y  X  X  a''"  “'v  log  am’u’1" 


(10) 


m=  1  teTm  veDm(t)  ueB(y)C\Cm(tmjV) 


Since  H(r ,  k\  r,  k)  is  maximized  at  r  =  r  and  k  —  k  from  equation  (10),  we  can  increase 
the  value  of  £.{r,  k\  Dm)  by  maximizing  Q(r,  k\  r,  k)  (see  equation  (8)).  Note  here  that 
although  log  is  a  linear  combination  of  log  ku  v,  log  ruv,  and  ru  v,  log  Bm^v  cannot 
be  written  as  such  a  linear  combination  (see  equations  (1),  (2)).  In  order  to  cope  with 
this  problem  of  log  SmMiV,  we  transform  log  'BmjLV  in  the  same  way  as  above,  and  define 
fim.u.y  by 

Pm,u,v  —  L(,v  GXp(  L/,v(/m,v  ^m,u))  /  Bmuv 

Finally,  as  the  solution  which  maximizes  Q(r,  k\  r,  k),  we  obtain  the  following  update 
formulas  of  our  estimation  method: 


ru,v 


KU,V 


2 meMtv  &m,u,v 


XlmcyVt,'  ,  (ft  m,u,v  “I”  (1  &m,u,v)f3m,u,v)(tm,v  t m,u ) 

1 


I  MU  +  \M~\ 


^  ^  (1  &m,u,v^f3m,u,v\ 


meAiu,v 
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where  Ai+ ,,  and  Al„  v  are  defined  by 

Aiu  v  —  {fii  G  {1»  -  M}n  w,  v  G  Dm ,  v  G  A(w),  tffijj  <  lm,v}s 

A(“v  =  {m  g  {1,  •  •  •  ,  Af};  m  e  Dm,  v  t  Dm,  v  g  F(w)}. 

Note  that  we  can  regard  our  estimation  method  as  a  kind  of  the  EM  algorithm.  It  should 
be  noted  here  that  each  time  iteration  proceeds  the  value  of  the  likelihood  function  never 
decreases  and  the  iterative  algorithm  is  guaranteed  to  converge. 

3.3  Behavioral  analysis 

Thus  far,  we  assumed  that  the  parameters  (time-delay  and  diffusion)  can  vary  with  re¬ 
spect  to  links  but  remain  the  same  irrespective  of  the  topic  of  information  diffused, 
following  Gruhl  et  al.  [3].  However,  they  may  be  sensitive  to  the  topic. 

Our  method  can  cope  with  this  by  assigning  m  to  a  topic,  and  placing  a  constraint 
that  the  parameters  depends  only  on  topics  but  not  on  links  throughout  the  network  G, 
that  is  rmMV  =  rm  and  Km^v  =  k,„  for  any  link  (n,  v)  e  E.  This  constraint  is  required 
because,  without  this,  we  have  only  one  piece  of  observation  for  each  ( m ,  u,  v)  and  there 
is  no  way  to  learn  the  parameters.  Noting  that  we  can  naturally  assume  that  people 
behave  quite  similarly  for  the  same  topic,  this  constraint  should  be  acceptable.  Under 
this  setting,  we  can  easily  obtain  the  parameter  update  formulas.  Using  each  pair  of  the 
estimated  parameters,  ( rm,Km ),  we  can  analyze  the  behavior  of  people  with  respect  to 
the  topics  of  information,  by  simply  plotting  (rm,  Km)  as  a  point  of  2-dimensional  space. 


3.4  Simple  case  analysis 

Here,  we  analyze  a  few  basic  properties  of  the  proposed  method  under  simple  settings. 
Assume  that  a  node  v  became  active  at  time  t  after  receiving  certain  information.  We 
denote  the  active  parent  nodes  of  v  by  u\,  ■  ■  ■  ,  un.  First,  we  consider  a  simple  case  that 
diffusion  parameter  a:  is  1  for  all  links,  time-delay  parameter  r  is  a  constant  and  the 
same  for  all  links,  and  the  activation  times  of  wi,  •  •  •  ,  «  v  are  all  zeros.  Then,  as  is  given 
in  equation  (3),  the  probability  density  that  the  node  v  is  activated  at  time  t  by  one  of 
the  parent  nodes,  can  be  expressed  as  follows 

V  /  n,  yv-l 

hv  =  y  rexp(-rf)  1  -  I  r  exp(-rr)c/r  =  Nr  exp(-Nrt). 


Similarly,  for  the  case  that  the  parent  nodes  u\,  ■  ■  ■  ,un  became  active  at  times  t\,  ■  ■  ■  W 
(<  t ),  respectively,  we  easily  obtain  the  following  probability. 


hv  =  Nr  exp 


/ 

-Nr 


1 

t - 

N 


The  maximum  likelihood  is  attained  by  maximizing  log  hY  with  respect  to  r,  and  the 
average  delay  time  is  obtained  as  follows: 


/ 

t 


1 

N 


r~l  =  N 
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We  can  see  that  this  estimation  is  N  times  larger  than  the  simple  average  of  time  differ¬ 
ences.  In  other  words,  the  information  diffuses  more  quickly  when  there  exist  multiple 
active  parents,  i.e.,  r  1  /N,  and  this  fact  matches  our  intuition.  Thus  simple  statistics 
such  as  the  average  delay  time  may  fail  to  provide  the  intrinsic  property  of  information 
diffusion  phenomena,  and  this  suggests  that  an  adequate  information  diffusion  model  is 
vital. 

Next,  we  consider  another  simple  case  that  the  diffusion  parameter  k  and  the  time- 
delay  parameter  r  are  both  uniform  and  constant  for  all  links,  and  the  activation  times 
of  mi,--  -  , uN  are  all  zeros.  Here  both  parameters  are  variables.  Then  the  probability 
density  that  the  node  v  is  activated  at  time  t  can  be  expressed  as  follows 

hv  -  NKr  exp(-rt)(K  exp(-rf)  +  (1  -  k))n~ 

Now,  we  consider  maximizing  /{k,  r )  =  log  hv  with  respect  to  k  and  r.  The  first-  and 
second-order  derivatives  of  J'(k,  r )  with  respect  to  k  are  given  by 

dfix,  r)  =  1  +  (N  _  n  exp(-rf)  -  1 
8  k  k  /cexp(-rf)  +  (1  —  k) 

d2f(K -  r)  =  _1  _  (N  _  exp(-rf)  -  1  2 

8k8k  k 2  k  expf -rt)  +  (1  -  k) 

Since  the  above  second-order  derivative  is  negative  definite  for  a  given  parameter  r,  we 
note  that  there  exists  a  unique  global  solution  to  k.  The  corresponding  derivatives  with 
respect  to  r  are  given  by 

d/U,  r )  =  1  _  t  _  (N  _  tK  exp(-rf) 

dr  r  /cexp(-rf)  +  (1  -  k) 

d2f(K,  r )  =  _J_  (N_  t2K(l  -  K)exp(-rt) 

drdr  r2  (/cexp(-rf)  +  (1  -  k))1. 

Unfortunately,  we  cannot  guarantee  that  the  above  second-order  derivative  is  negative 
definite.  However,  most  likely,  this  value  is  negative  when  r  «:  1,  and  can  be  positive 
when  r  »  1  in  which  case  the  shape  of  the  objective  function  can  be  complex.  We  can 
speculate  that  the  convergence  is  better  for  a  smaller  value  of  r.  Later,  in  our  experi¬ 
ments,  we  empirically  evaluate  this  point  by  using  the  method  described  in  3.1  and  3.2 
with  r  -  2  and  r  =  0.5,  which  are  in  the  range  that  is  widely  explored  by  many  existing 
studies.  Clearly,  we  need  to  perform  further  theoretical  and  empirical  studies  because 
we  are  simultaneously  estimating  both  diffusion  and  time-delay  parameters,  k  and  r. 
However,  the  experiments  show  that  our  method  is  stable  for  the  range  of  parameters 
we  used,  indicating  that  the  likelihood  function  has  favorable  mathematical  properties. 

4  Experiments  with  Artificial  data 

We  evaluated  the  effectiveness  of  the  proposed  learning  method  using  the  topologies 
of  two  large  real  network  data.  First,  we  evaluated  how  accurately  it  can  estimate  the 
parameters  of  the  CTIC  model  from  Dm-  Next,  we  considered  applying  our  learning 
method  to  the  problem  of  extracting  influential  nodes,  and  evaluated  how  well  our 
learned  model  can  predict  the  high  ranked  influential  nodes  with  respect  to  influence 
degree  cr(v),  (v  e  V )  for  the  true  CTIC  model. 
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4.1  Experimental  Settings 


In  our  experiments,  we  employed  two  datasets  of  large  real  networks  used  in  [9],  which 
exhibit  many  of  the  key  features  of  social  networks.  The  first  one  is  a  trackback  network 
of  Japanese  blogs.  The  network  data  was  collected  by  tracing  the  trackbacks  from  one 
blog  in  the  site  goo 1  in  May,  2005.  We  refer  to  this  network  data  as  the  blog  network. 
The  blog  network  was  a  strongly-connected  bidirectional  network,  where  a  link  created 
by  a  trackback  was  regarded  as  a  bidirectional  link  since  blog  authors  establish  mutual 
communications  by  putting  trackbacks  on  each  other’s  blogs.  The  blog  network  had 
12, 047  nodes  and  79, 920  directed  links.  The  second  one  is  a  network  of  people  that  was 
derived  from  the  “list  of  people”  within  Japanese  Wikipedia.  We  refer  to  this  network 
data  as  the  Wikipedia  network.  The  Wikipedia  network  was  also  a  strongly-connected 
bidirectional  network,  and  had  9,481  nodes  and  245,044  directed  links. 

Here,  we  assumed  the  simplest  case  where  ru  v  and  ku  v  are  uniform  throughout  the 
network  G,  that  is,  ruv  =  r,  ku  v  =  k  for  any  link  (m,  v)  e  E.  One  reason  behind  this 
assumption  is  that  we  can  make  fair  comparison  with  the  existing  heuristics  that  are 
solely  based  on  network  structure  (see  4.2).  Another  reason  is  that  there  is  no  need 
to  acquire  observation  sequence  data  that  at  least  pass  through  every  link  once.  This 
drastically  reduces  the  amount  of  data  to  learn  the  parameters.  Then,  our  task  is  to 
estimate  the  values  of  r  and  k.  According  to  [7],  we  set  the  value  of  k  relatively  small. 
In  particular,  we  set  the  value  of  k  to  a  value  smaller  than  1  /d,  where  d  is  the  mean 
out-degree  of  a  network.  Since  the  values  of  d  were  about  6.63  and  25.85  for  the  blog 
and  the  Wikipedia  networks,  respectively,  the  corresponding  values  of  1  /d  were  about 
0.15  and  0.03.  Thus,  as  for  the  true  value  of  the  diffusion  parameter  k,  we  decided  to 
set  k  =  0. 1  for  the  blog  network  and  k  —  0.01  for  the  Wikipedia  network.  As  for  the 
true  value  of  the  time-delay  parameter  r,  we  decided  to  investigate  two  cases:  one  with 
a  relatively  high  value  r  —  2  (a  short  time-delay  case)  and  the  other  with  a  relatively 
low  value  r  =  0.5  (a  long  time-delay  case)  in  both  networks.  We  used  the  training  data 
Dm  in  the  learning  stage,  which  is  constructed  by  generating  each  Dm  from  a  randomly 
selected  initial  active  node  Dm{ 0)  using  the  true  CTIC  model.  Trn  was  chosen  to  be 
effectively  oo. 

We  note  that  the  influence  degree  cr(v)  of  a  node  v  is  invariant  with  respect  to  the 
values  of  the  delay-parameter  r.  In  fact,  the  effect  of  r  is  to  delay  the  timings  when  nodes 
become  active,  that  is,  parameter  ru  v  only  controls  how  soon  or  late  node  v  actually 
becomes  active  when  node  u  activates  node  v.  Therefore,  nodes  that  can  be  activated 
are  in  indeed  activated  eventually  after  a  sufficiently  long  time  has  elapsed,  which  is 
the  case  here,  i.e.  Tm  =  oo.  Thus,  we  can  evaluate  the  <x(v)  of  the  CTIC  model  by 
the  influence  degree  of  v  for  the  corresponding  IC  model.  We  estimated  the  influence 
degrees  (cr(v);  v  e  V )  using  the  method  of  [8]  with  the  parameter  value  10, 000,  where 
the  parameter  represents  the  number  of  bond  percolation  processes  (we  do  not  describe 
the  method  here  due  to  the  page  limit).  The  average  value  and  the  standard  deviation  of 
the  influence  degrees  was  87.5  and  131  for  the  blog  network,  and  8.14  and  18.4  for  the 
Wikipedia  network. 


2  http://blog.goo.ne.jp/ 
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Table  1 :  Learning  performance  by  the  proposed  method. 


Blog  network  Wikipedia  network 


Blog  network 


Wikipedia  network 


{r  =  2) 
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M 
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S' 
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0.013 

0.015 

20 

0.036 

0.034 
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0.010 

0.010 

40 

0.024 

0.016 

60 

0.008 

0.008 

60 

0.013 

0.015 

80 

0.007 

0.007 

80 

0.012 

0.013 

100 

0.005 

0.005 

100 

0.006 

0.011 

(r  =  0.5)  (r  =  0.5) 


M 

Sr 

S' 

M 

Sr 

sK 

20 

0.011 

0.012 

20 

0.026 

0.028 

40 

0.010 

0.007 

40 

0.021 

0.023 

60 

0.009 

0.005 

60 

0.018 

0.021 

80 

0.004 

0.004 

80 

0.014 

0.012 

100 

0.004 

0.004 

100 

0.007 

0.006 

4.2  Comparison  Methods 

We  compared  the  predicted  result  of  the  high  ranked  influential  nodes  for  the  true  CTIC 
model  by  the  proposed  method  with  four  heuristics  widely  used  in  social  network  anal¬ 
ysis. 

The  first  three  of  these  heuristics  are  “degree  centrality”,  “closeness  centrality”,  and 
“betweenness  centrality”.  These  are  commonly  used  as  influence  measure  in  sociology 
[11],  where  the  out-degree  of  node  v  is  defined  as  the  number  of  links  going  out  from 
v,  the  closeness  of  node  v  is  defined  as  the  reciprocal  of  the  average  distance  between 
v;  and  other  nodes  in  the  network,  and  the  betweenness  of  node  v  is  defined  as  the 
total  number  of  shortest  paths  between  pairs  of  nodes  that  pass  through  v.  The  fourth 
is  “authoritativeness”  obtained  by  the  “PageRank”  method  [12],  We  considered  this 
measure  since  this  is  a  well  known  method  for  identifying  authoritative  or  influential 
pages  in  a  hyperlink  network  of  web  pages.  This  method  has  a  parameter  e;  when  we 
view  it  as  a  model  of  a  random  web  surfer,  s  corresponds  to  the  probability  with  which 
a  surfer  jumps  to  a  page  picked  uniformly  at  random  [13].  In  our  experiments,  we  used 
a  typical  setting  of  s  =  0.15. 


4.3  Experimental  Results 


First,  we  examined  the  parameter  estimation  accuracy  by  the  proposed  method.  Let  r0 
and  K{)  be  the  true  values  of  the  parameters  r  and  k,  respectively,  and  let  r  and  k  be 
the  values  of  r  and  k  estimated  by  the  proposed  method,  respectively.  We  evaluated  the 
learning  performance  in  terms  of  the  error  rates. 


ko  -  n 
ro 


&K  = 


ko  -  *  I 

K0 


Table  1  shows  the  average  values  of  <5,  and  <5A  for  different  numbers  of  training  samples, 
M.  For  each  M  we  repeated  the  same  experiment  5  times  independently,  and  for  each 
experiment  we  tried  5  different  initial  values  of  the  parameters  that  are  randomly  drawn 
from  [0,1]  with  uniform  distribution.  The  convergence  criterion  is 


k(n)  _  /c("+1)|  +  \r(n)  -  r("+1)|  <  10-12, 
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(a)  blog  network  (r  =  2)  (b)  Wikipedia  network  ( r  =  2) 


(c)  blog  network  (r  =  0.5)  (d)  Wikipedia  network  (r  =  0.5) 


Fig.  1 :  Performance  comparison  in  extracting  influential  nodes. 


where  the  superscript  (n)  indicates  the  value  for  the  nth  iteration.  Our  algorithm  con¬ 
verged  at  around  40  iterations  for  the  blog  data  and  70  iterations  for  the  Wikipedia  data. 
Further,  it  is  observed,  as  predicted  by  the  simple  case  analysis  in  3.4,  that  the  conver¬ 
gence  was  faster  for  a  smaller  value  of  r.  The  converged  values  are  close  to  the  true 
values  when  there  is  a  reasonable  amount  of  training  data.  The  results  demonstrate  the 
effectiveness  of  the  proposed  method. 

Next,  we  compared  the  proposed  method  with  the  out-degree,  the  betweenness,  the 
closeness,  and  the  PageRank  methods  in  terms  of  the  capability  of  ranking  the  influen¬ 
tial  nodes.  For  any  positive  integer  k  (<  |V|),  let  La(k)  be  the  true  set  of  top  k  nodes, 
and  let  L(k)  be  the  set  of  top  k  nodes  for  a  given  ranking  method.  We  evaluated  the 
performance  of  the  ranking  method  by  the  ranking  similarity  F(k)  at  rank  k,  where  F(k) 
is  defined  by 

\LQ(k)  n  L(k)\ 

F(k)  =  - - - . 

k 

We  focused  on  ranking  similarities  only  at  high  ranks  since  we  are  interested  in  ex¬ 
tracting  influential  nodes.  Figures  la  and  lc  show  the  results  for  the  blog  network,  and 
Figures  lb  and  Id  show  the  results  for  the  Wikipedia  network,  where  the  true  value  of 
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r  is  r  —  2  and  r  —  0.5  for  Figures  la  and  lb,  and  Figures  lc  and  Id,  respectively.  In 
these  figures,  circles,  triangles,  diamonds,  squares,  and  asterisks  indicate  ranking  sim¬ 
ilarity  F(k)  as  a  function  of  rank  k  for  the  proposed,  the  out-degree,  the  betweenness, 
the  closeness,  and  the  PageRank  methods,  respectively.  For  the  proposed  method,  we 
plotted  the  average  value  of  F(k)  at  k  for  5  experimental  results  stated  earlier  in  the 
case  of  M  =  100.  The  proposed  method  gives  far  better  results  than  the  other  heuristic 
based  methods  for  the  both  networks,  demonstrating  the  effectiveness  of  our  proposed 
learning  method. 

5  Behavioral  Analysis  of  Real  World  Blog  Data 

We  applied  our  method  to  behavioral  analysis  using  a  real  world  blog  data  based  on 
the  method  described  in  3.3  and  investigated  how  each  topic  spreads  throughout  the 
network. 


5.1  Experimental  Settings 


The  network  we  used  is  a  real  blogroll  network  in  which  bloggers  are  connected  to  each 
other.  We  note  that  when  there  is  a  blogroll  link  from  blogger  y  to  another  blogger  x, 
this  means  that  y  is  a  reader  of  the  blog  of  x.  Thus,  we  can  assume  that  topics  propagate 
from  blogger  x  to  blogger  y.  According  to  [14],  we  suppose  that  a  topic  is  represented 
as  a  URL  which  can  be  tracked  down  from  blog  to  blog.  We  used  the  database  of 
a  blog-hosting  service  in  Japan  called  Doblog  3.  The  database  is  constructed  by  all 
the  Doblog  data  from  October  2003  to  June  2005,  and  contains  52, 525  bloggers  and 
1 15, 552  blogroll  links. 

We  identified  all  the  URLs  mentioned  in  blog  posts  in  the  Doblog  database,  and 
constructed  the  following  list  for  each  URL  from  all  the  blog  posts  that  contain  the 
URL: 


<(Vl ,  f  l),  •  •  '  ,  (\’k,  tkj),  (?!<•••<  4), 


where  v,  is  a  blogger  who  mentioned  the  URL  in  her/his  blog  post  published  at  time 
tj.  By  taking  into  account  the  blogroll  relations  for  the  list,  we  estimated  such  paths 
that  the  URL  might  propagate  through  the  blogroll  network.  We  extracted  7, 356  URL 
propagation  paths  from  the  Doblog  dataset,  where  we  ignored  the  URLs  that  only  one 
blogger  mentioned.  Out  of  these,  only  those  that  are  longer  than  10  time  steps  are  chosen 
for  analyses,  resulting  into  172  sequences.  Each  sequence  data  represents  a  topic,  and 
a  topic  can  be  distributed  in  multiple  URLs.  The  same  URL  can  appear  in  different 
sequences.  Here  note  that  the  time  stamp  of  each  blog  article  is  different  from  each  other 
and  thus,  the  time  intervals  in  the  sequence  <  t\ ,  f2, ...,  4  >  are  not  a  fixed  constant. 


5.2  Experimental  Results 

We  ran  the  experiments  for  each  identified  URL  and  obtained  the  corresponding  param¬ 
eters  k  and  r.  Figure  2  is  a  plot  of  the  results  for  the  major  URLs.  The  horizontal  axis 

3  Doblogfhttp :  //www .  doblog .  com/),  provided  by  NTT  Data  Corp.  and  Hotto  Link,  Inc. 
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Fig.  2:  Results  for  the  Doblog  database. 


is  the  diffusion  parameter  k  and  the  vertical  axis  is  the  delay  parameter  r.  The  latter  is 
normalized  such  that  r  =  1  corresponds  to  a  delay  of  one  day,  meaning  r  =  0. 1  cor¬ 
responds  delay  of  10  days.  We  only  explain  three  URLs  that  exhibit  some  interesting 
propagation  properties.  The  circle  is  a  URL  that  corresponds  to  the  musical  baton  which 
is  a  kind  of  telephone  game  on  the  Internet.  It  has  the  following  rules.  First,  a  blogger 
is  requested  to  respond  to  five  questions  about  music  by  some  other  bologger  (receive 
the  baton)  and  the  requested  blogger  replies  to  the  questions  and  designate  the  next  five 
bloggers  with  the  same  questions  (pass  the  baton).  It  is  shown  that  this  kind  of  message 
propagates  quickly  (less  than  one  day  on  the  average)  with  a  good  chance  (one  out  of 
25  to  100  persons  responds).  This  is  probably  because  people  are  interested  in  this  kind 
of  message  passing.  The  square  is  a  URL  that  corresponds  to  articles  about  a  missing 
child.  This  also  propagates  quickly  with  a  meaningful  probability  (one  out  of  80  persons 
responds).  This  is  understandable  considering  the  urgency  of  the  message.  The  cross  is 
a  URL  that  corresponds  to  articles  about  fortune  telling.  Peoples  responses  are  diverse. 
Some  responds  quickly  (less  than  one  day)  and  some  late  (more  than  one  month  af¬ 
ter),  and  they  are  more  or  less  uniformly  distributed.  The  diffusion  probability  is  also 
nearly  uniformly  distributed.  This  reflects  that  each  individual’s  interest  is  different  on 
this  topic.  The  dot  is  a  URL  that  corresponds  to  one  of  the  other  topics.  Interestingly, 
the  one  in  the  bottom  right  which  is  isolated  from  the  rest  is  a  post  of  an  invitation  to  a 
rock  music  festival.  This  one  has  a  very  large  probability  of  being  propagated  but  with 
very  large  time  delay.  In  general,  it  can  be  said  that  the  proposed  method  can  extract 
characteristic  properties  of  a  certain  topics  reasonably  well  only  from  the  observation 
data. 
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6  Discussion 


Being  able  to  handle  the  time  more  precisely  brings  a  merit  to  the  analysis  of  such 
information  diffusion  as  in  a  blog  data  because  the  time  stamp  is  available  in  the  unit  of 
second.  There  are  subtle  cases  where  it  is  not  self  evident  to  which  value  to  assign  the 
time  when  the  discretization  has  to  be  made.  We  have  solved  this  problem. 

There  are  many  pieces  of  work  in  which  time  sequence  data  is  analyzed  assuming 
a  certain  model  behind.  Ours  also  falls  in  this  category.  The  proposed  approach  brings 
in  a  new  perspective  in  which  it  allows  to  use  the  structure  of  a  complex  network  as  a 
kind  of  background  knowledge  in  a  more  refined  way.  There  are  also  many  pieces  of 
work  on  topic  propagation  analyses,  but  they  focus  mostly  on  the  analyses  of  average 
propagation  speed  (propagation  speed  distribution)  and  average  life  time.  Our  method 
is  new  and  different  in  that  we  explicitly  address  the  diffusion  phenomena  incorporating 
diffusion  probability  and  time  delay  as  well  as  the  structure  of  the  network. 

The  proposed  method  derives  the  learning  algorithm  in  a  principled  way.  The  ob¬ 
jective  function  has  a  clear  meaning  of  the  likelihood  by  which  to  obtain  the  observed 
data,  and  the  parameter  is  iteratively  updated  in  such  a  way  to  maximize  the  likelihood, 
guaranteeing  the  convergence.  Due  to  the  property  of  continuous  time,  we  excluded  the 
possibility  that  a  node  is  activated  simultaneously  by  multiple  parent  nodes.  It  is  also 
straightforward  to  formulate  the  likelihood  taking  the  possibility  of  the  simultaneous 
activation  into  account.  However,  the  numerical  experiments  revealed  that  the  results 
are  not  as  accurate  as  the  current  model.  Having  to  explore  millions  of  paths  with  very 
small  probability  does  harm  numerical  computation.  This  is,  in  a  sense,  similar  to  the 
problem  of  feature  selection  in  building  a  classifier.  It  is  known  that  the  existence  of 
irrelevant  features  is  harmful  even  though  the  classification  algorithm  can  in  theory 
ignore  those  irrelevant  features. 

The  CTIC  model  is  a  continuous-time  information  diffusion  model  that  extends  the 
discrete-time  model  by  Gruhl  et  al  [15].  We  note  that  their  model  is  based  on  the  popular 
IC  model  and  they  model  the  time-delay  by  a  geometic  distribition.  In  the  CTIC  model, 
we  model  a  time-delay  by  an  exponential  distribution.  Song  et  al  [16]  also  modeled 
time-delays  of  information  flow  by  exponential  distributions  in  formulating  an  infor¬ 
mation  flow  model  by  a  continuous-time  Markov  chain  (i.e.,  a  random-surfer  model). 
Thus,  we  can  regard  the  CTIC  model  as  a  natural  extension  to  continuous-time  infor¬ 
mation  diffusion  model  based  on  the  IC  model,  and  investigating  its  characteristics  can 
be  an  important  research  issue.  As  explained  in  Section  2.2,  the  CTIC  model  is  rather 
complicated,  and  developing  a  learning  algorithm  of  the  CTIC  model  is  challenging.  In 
this  paper,  we  presented  an  effective  method  for  estimating  the  parameters  of  the  CTIC 
model  from  observed  data,  and  applied  it  to  node-ranking  and  social  behavioral  data 
analysis.  To  the  best  of  our  knowledge,  we  are  the  first  to  formulate  a  continuous-time 
information  diffusion  model  based  on  the  IC  model  and  a  rigorous  learning  algorithm  to 
estimate  the  model  parameters  from  observation.  We  are  not  claiming  that  the  model  is 
most  accurate.  The  time-delay  distribution  for  real  information  diffusion  must  be  more 
complex,  and  a  power-law  distribution  and  the  like  might  be  more  suitable.  Our  fu¬ 
ture  work  includes  incorpolating  various  more  realistic  distributions  as  the  time-delay 
distribution. 
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The  learning  algorithm  we  proposed  is  a  one-time  batch  processing.  In  reality  the 
observation  data  are  keep  coming  and  the  environment  may  change  over  time.  It  is 
not  straightforward  to  convert  the  algorithm  to  incremental  mode.  The  simplest  way  to 
cope  with  this  is  to  use  a  fixed  time  window  to  collect  data  and  use  the  parameters  at 
the  previous  window  as  the  initial  guesses. 

We  consider  that  our  proposed  ranking  method  presents  a  novel  concept  of  cen¬ 
trality  based  on  the  information  diffusion  model,  i.e.,  the  CTIC  model.  Actually,  nodes 
identified  as  higher  ranked  by  our  method  are  substantially  different  from  those  by  each 
of  the  conventional  methods.  This  means  that  our  method  enables  a  new  type  of  social 
network  analysis  if  past  information  diffusion  data  are  available.  Note  that  this  is  not 
to  claim  to  replace  them  with  the  proposed  method,  but  simply  to  propose  that  it  is  an 
addition  to  them  which  has  a  different  merit  in  terms  of  information  diffusion. 

We  note  that  the  analysis  we  showed  in  this  paper  is  the  simplest  case  where  k  and  r 
take  a  single  value  each  for  all  the  links  in  E.  However,  the  method  is  very  general.  In  a 
more  realistic  setting  we  can  divide  E  into  subsets  E \ ,  Ej , ...,  Em  and  assign  a  different 
value  k„  and  r„  for  all  the  links  in  each  E„.  For  example,  we  may  divide  the  nodes 
into  two  groups:  those  that  strongly  influence  others  and  those  not,  or  we  may  divide 
the  nodes  into  another  two  groups:  those  that  are  easily  influenced  by  others  and  those 
not.  We  can  further  divide  the  nodes  into  multiple  groups.  If  there  is  some  background 
knowledge  about  the  node  grouping,  our  method  can  make  the  best  use  of  it. 


7  Conclusion 

We  emphasized  the  importance  of  incorporating  continuous  time  delay  for  the  behav¬ 
ioral  analysis  of  information  diffusion  through  a  social  network,  and  addressed  the 
problem  of  estimating  the  parameters  for  a  continuous  time  delay  independent  cas¬ 
cade  (CTIC)  model  from  the  observed  data  by  rigorously  formulating  the  likelihood 
of  obtaining  these  data  and  maximizing  the  likelihood  iteratively  with  respect  to  the 
parameters  (time-delay  and  diffusion).  We  tested  the  convergence  performance  of  the 
proposed  method  by  applying  it  to  the  problem  of  ranking  influential  nodes  using  the 
network  structure  from  two  real  world  web  datasets  and  showed  that  the  parameters 
converge  to  the  correct  values  efficiently  by  the  iterative  procedure  and  can  predict  the 
high  ranked  influential  nodes  much  more  accurately  than  the  well  studied  four  heuristic 
methods.  We  further  applied  the  method  to  the  problem  of  behavioral  analysis  of  topic 
propagation  using  a  real  world  blog  data  and  showed  that  there  are  indeed  sensible  dif¬ 
ferences  in  the  propagation  patterns  in  terms  of  delay  and  diffusion  among  different 
topics. 
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Extracting  Influential  Nodes  on  a  Social  Network 
for  Information  Diffusion 


Masahiro  Kimura,  Kazumi  Saito,  Ryohei  Nakano, 
and  Hiroshi  Motoda 


Abstract 

We  address  the  combinatorial  optimization  problem  of  finding  the  most 
influential  nodes  on  a  large-scale  social  network  for  two  widely-used  funda¬ 
mental  stochastic  diffusion  models.  The  past  study  showed  that  a  greedy 
strategy  can  give  a  good  approximate  solution  to  the  problem.  However,  a 
conventional  greedy  method  faces  a  computational  problem.  We  propose  a 
method  of  efficiently  finding  a  good  approximate  solution  to  the  problem  un¬ 
der  the  greedy  algorithm  on  the  basis  of  bond  percolation  and  graph  theory, 
and  compare  the  proposed  method  with  the  conventional  method  in  terms  of 
computational  complexity  in  order  to  theoretically  evaluate  its  effectiveness. 
The  results  show  that  the  proposed  method  is  expected  to  achieve  a  great  re¬ 
duction  in  computational  cost.  We  further  experimentally  demonstrate  that 
the  proposed  method  is  much  more  efficient  than  the  conventional  method 
using  large-scale  real-world  networks  including  blog  networks. 


Keywords 

Social  network  analysis,  Information  diffusion  model,  Influence  maxi¬ 
mization  problem,  Bond  percolation 


1 


Authors’  Addresses: 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 


Masahiro  Kimura 

Department  of  Electronics  and  Informatics 
Ryukoku  University 
Otsu  520-2194,  Japan 
kimura@rins.ryukoku.ac.jp 


Kazurni  Saito 

School  of  Administration  and  Informatics 
University  of  Shizuoka 
Shizuoka  422-8526,  Japan 
k-saito@u-shizuoka-ken.ac.jp 


Ryohei  Nakano 

Department  of  Computer  Science 
Chubu  University 
Aichi  487-8501,  Japan 
nakano@cs.chubu.ac.jp 


Hiroshi  Motoda 

Institute  of  Scientific  and  Industrial  Research 
Osaka  University 
Osaka  567-0047,  Japan 
motoda@ar.sanken.osaka-u.ac.jp 


2 


1  Introduction 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 


The  rise  of  the  Internet  and  the  World  Wide  Web  has  enabled  us  to  inves¬ 
tigate  large-scale  social  networks,  and  there  has  been  growing  interest  in 
social  network  analysis  (Newman,  2001;  McCallum  et  ah,  2005;  Leskovec  et 
ah,  2006).  Here,  a  social  network  is  the  network  of  relationships  and  inter¬ 
actions  among  social  entities  such  as  individuals,  groups  of  individuals,  and 
organizations.  Examples  include  blog  networks,  collaboration  networks,  and 
email  networks. 

The  social  network  of  interactions  within  a  group  of  individuals  plays  a 
fundamental  role  in  the  spread  of  information,  ideas,  and  innovations.  In 
fact,  a  piece  of  information,  such  as  the  URL  of  a  website  that  provides  a 
new  valuable  service,  can  spread  from  one  individual  to  another  through  the 
social  network  in  the  form  of  “word-of-mouth”  communication.  For  exam¬ 
ple,  the  information  of  free  email  services  such  as  Microsoft’s  Hotmail  and 
Google’s  Gmail  could  spread  largely  through  email  networks.  Thus,  when 
we  plan  to  market  a  new  product,  promote  an  innovation,  or  spread  a  new 
topic  among  a  group  of  individuals,  we  can  exploit  social  network  effects. 
Namely,  we  can  target  a  small  number  of  influential  individuals  (e.g.,  giving 
free  samples  of  the  product,  demonstrating  the  innovation,  or  offering  the 
topic),  and  trigger  a  cascade  of  influence  by  which  friends  will  recommend 
the  product,  promote  the  innovation,  or  propagate  the  topic  to  other  friends. 
In  this  way,  we  can  spread  decisions  in  adopting  the  product,  the  innovation, 
or  the  topic  through  the  social  network  from  a  small  set  of  initial  adopters 
to  many  individuals.  Therefore,  given  a  social  network  represented  by  a  di¬ 
rected  graph,  a  positive  integer  k,  and  a  probabilistic  model  for  the  process 
by  which  a  certain  information  spreads  through  the  network,  it  is  an  impor¬ 
tant  research  issue  in  terms  of  sociology  and  viral  marketing  to  find  such  a 
target  set  A*k  of  k  nodes  that  maximizes  the  expected  number  of  adopters  of 
the  information  if  At  initially  adopts  it  (Domingos  and  Richardson,  2001; 
Richardson  and  Domingos,  2002;  Kernpe  et  al,  2003;  Kempe  et  ah,  2005). 
Here,  the  expected  number  of  nodes  influenced  by  a  target  set  is  referred  to 
as  its  influence  degree,  and  this  combinatorial  optimization  problem  is  called 
the  influence  maximization  problem  of  size  k. 

Kempe  et  al.  (2003)  studied  the  influence  maximization  problem  for  two 
widely-used  fundamental  information  diffusion  models,  the  independent  cas¬ 
cade  (IC)  model  (Goldenberg,  2001;  Kempe  et  ah,  2003;  Gruhl  et  ah,  2004) 
and  the  linear  threshold  (LT)  model  (Watts,  2002;  Kempe  et  ah,  2003).  They 
experimentally  showed  on  large  collaboration  networks  that  for  the  influence 
maximization  problem  under  the  IC  and  LT  models,  the  greedy  algorithm 
significantly  outperforms  the  high-degree  and  centrality  heuristics  that  are 
commonly  used  in  the  sociology  literature.  Here,  the  high-degree  heuristic 
chooses  nodes  in  order  of  decreasing  degrees,  and  the  centrality  heuristic 
chooses  nodes  in  order  of  increasing  average  distance  to  other  nodes  in  the 
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network.  Moreover,  they  mathematically  proved  a  performance  guarantee 
of  the  greedy  algorithm  under  these  information  diffusion  models  (i.e. ,  the 
IC  and  LT  models)  by  using  an  analysis  framework  based  on  submodular 
functions. 

For  the  influence  maximization  problem  of  size  k,  the  greedy  algorithm 
iteratively  finds  a  target  set  A f.  of  k  nodes  from  the  target  set  A^-i  of  k  —  1 
nodes  that  it  has  already  found.  Thus,  it  requires  a  method  of  computing 
all  the  marginal  influence  degrees  of  a  given  set  A  of  nodes  in  the  network. 
Here,  for  any  node  v  that  does  not  belong  to  A.  the  influence  degree  of 
target  set  A  U  {u}  is  referred  to  as  the  marginal  influence  degree  of  A  at 
v.  However,  it  is  an  open  question  to  compute  influence  degrees  exactly  by 
an  efficient  method,  and  therefore,  the  conventional  method  had  to  obtain 
good  estimates  for  influence  degrees  by  simulating  the  random  process  of  the 
information  diffusion  model  (i.e.,  the  IC  or  LT  model)  many  times  (Kernpe 
et  al.,  2003).  Solving  the  influence  maximization  problem  under  the  greedy 
algorithm  needed  a  large  amount  of  computation  for  large-scale  networks. 

In  this  paper,  for  the  IC  and  LT  models,  we  propose  a  method  of  effi¬ 
ciently  estimating  all  the  marginal  influence  degrees  of  a  given  set  of  nodes 
on  the  basis  of  bond  percolation  and  graph  theory,  and  apply  it  to  ap¬ 
proximately  solving  the  influence  maximization  problem  under  the  greedy 
algorithm.  In  order  to  theoretically  evaluate  the  effectiveness  of  the  pro¬ 
posed  method  for  solving  the  influence  maximization  problem,  we  compare 
the  proposed  method  with  the  conventional  method  in  terms  of  compu¬ 
tational  complexity,  and  show  that  the  proposed  method  is  expected  to 
achieve  a  large  reduction  in  computational  cost.  Further,  using  large-scale 
real  networks  including  blog  networks,  we  experimentally  demonstrate  that 
the  proposed  method  is  much  more  efficient  than  the  conventional  method. 
Finally,  we  discuss  some  related  work,  and  describe  the  conclusion. 

2  Definitions 

We  examine  the  influence  maximization  problem  on  a  network  represented 
by  a  directed  graph  G  =  (V,  E)  for  the  IC  and  LT  models.  Here,  V  and  E 
are  the  sets  of  all  the  nodes  and  links  in  the  network,  respectively.  Let  N 
and  L  be  the  numbers  of  elements  of  V  and  E,  respectively. 

We  first  recall  some  basic  notions  from  graph  theory.  Next,  we  define  the 
IC  and  LT  models  on  G  according  to  the  work  of  Kernpe  et  al.  (2003).  Last, 
we  give  a  mathematical  definition  of  the  influence  maximization  problem. 

2.1  Graphs 

We  consider  a  directed  graph  G  =  (V,E).  If  there  is  a  directed  link  (u,v) 
from  node  u  to  node  v,  node  v  is  called  a  child  node  of  node  u  and  node  u  is 
called  a  parent  node  of  node  v.  For  any  v  £  V,  let  T(u)  denote  the  set  of  all 
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the  parent  nodes  of  v.  For  a  subset  V'  of  V,  graph  G'  =  ( V',E ')  is  called 

the  induced  graph  of  G  to  V'  if  E'  =  E  fl  (V'  x  V'). 

We  call  (uq,  ■  ■  •,  up)  a  path  from  node  «o  to  node  U£  if  we  have  Ui) 

€  E,  (i  =  1,  •  •  •  ,£).  We  say  that  node  u  can  reach  node  v  or  node  v  is 
reachable  from  node  u  if  there  is  a  path  from  node  u  to  node  v.  For  a  node 
v  of  the  graph  G ,  we  define  F(v ;  G )  to  be  the  set  of  all  the  nodes  that  are 
reachable  from  v,  and  define  B(y ;  G )  to  be  the  set  of  all  the  nodes  that  can 
reach  v.  For  any  A  C  V,  we  set 

F(A;G)  =  |J  F(v;G),  B(A;G )  =  |J  B(v,G). 

v£  A  v£A 

A  strongly  connected  component  (SCC)  of  G  is  a  maximal  subset  C  of  V 
such  that  for  all  u,  v  €  C  there  is  a  path  from  u  to  v.  For  a  node  v  of  G, 

we  define  SCC(v,  G)  to  be  the  SCC  that  contains  v. 

2.2  Information  Diffusion  Models 

We  consider  mathematically  modeling  the  spread  of  certain  information 
through  a  social  network  G  =  (V,  E).  In  the  IC  and  LT  models,  the  fol¬ 
lowing  assumptions  are  made: 

•  A  node  is  called  active  if  it  has  adopted  the  information. 

•  The  state  of  a  node  is  either  active  or  inactive. 

•  Nodes  can  switch  from  being  inactive  to  being  active,  but  cannot 
switch  from  being  active  to  being  inactive. 

•  The  spread  of  the  information  through  the  network  G  is  represented 
as  the  spread  of  active  nodes  on  G. 

•  Given  an  initial  set  A  of  active  nodes,  we  suppose  that  the  nodes  in  A 
first  become  active  and  all  the  other  nodes  remain  inactive  at  time-step 
0. 

•  The  diffusion  process  of  active  nodes  unfolds  in  discrete  time-steps 

t  >  0. 

2.2.1  Independent  Cascade  Model 

First,  we  define  the  independent  cascade  (IC)  model.  In  this  model,  we 
specify  a  real  value  plhV  £  [0, 1]  for  each  directed  link  (u,  v)  in  advance. 
Here,  pU)V  is  referred  to  as  the  propagation  probability  through  link  (u,v). 
When  an  initial  set  A  of  active  nodes  is  given,  the  diffusion  process  of  active 
nodes  proceeds  according  to  the  following  randomized  rule.  When  node  u 
first  becomes  active  at  time-step  t.  it  is  given  a  single  chance  to  activate 
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each  of  its  currently  inactive  child  nodes  v,  and  succeeds  with  probability 
pu,v  If  u  succeeds,  then  v  will  become  active  at  time-step  t  +  1.  Here,  if 
v  has  multiple  parent  nodes  that  become  active  at  time-step  t  for  the  first 
time,  then  their  activation  attempts  are  sequenced  in  an  arbitrary  order,  but 
performed  at  time-step  t.  Whether  or  not  u  succeeds,  it  cannot  make  any 
further  attempts  to  activate  v  in  subsequent  rounds.  The  process  terminates 
if  no  more  activations  are  possible. 

For  an  initial  active  set  A  (c  V),  let  tp(A)  denote  the  number  of  active 
nodes  at  the  end  of  the  random  process  for  the  IC  model.  Note  that  <p(A) 
is  a  random  variable.  Let  cr(A)  denote  the  expected  value  of  <p(A).  We  call 
cr (A)  the  influence  degree  of  A. 

2.2.2  Linear  Threshold  Model 

Next,  we  define  the  linear  threshold  (LT)  model.  In  this  model,  for  any  node 
v  €  V,  we  in  advance  specify  a  weight  wUiV  (>  0)  from  its  parent  node  u 
such  that 

y  \  wu,v  —  l 

u€.T(v) 

When  an  initial  set  A  of  active  nodes  is  given,  the  diffusion  process  of  active 
nodes  proceeds  according  to  the  following  randomized  rule.  First,  for  any 
node  v  €  V,  a  threshold  9V  is  chosen  uniformly  at  random  from  the  interval 
[0, 1].  At  time-step  t,  an  inactive  node  v  is  influenced  by  each  of  its  active 
parent  nodes  u  according  to  weight  vjU)V.  If  the  total  weight  from  active 
parent  nodes  of  v  is  at  least  threshold  9V,  that  is, 

y  \  wu^v  ^  9V , 

uert(v) 

then  v  will  become  active  at  time-step  t  +  1.  Here,  Tt(v)  stands  for  the  set 
of  parent  nodes  of  v  that  are  active  at  time-step  t.  The  process  terminates 
if  no  more  activations  are  possible. 

Note  that  the  threshold  9V  models  the  tendency  of  node  v  to  adopt 
the  information  when  its  parent  nodes  do.  Note  also  that  the  LT  model 
is  a  probabilistic  model  associated  with  the  uniform  distribution  on  [0, 1]W. 
Further  note  that  in  the  LT  model  it  is  the  node  thresholds  that  are  random, 
while  in  the  IC  model  it  is  the  propagations  through  links  that  are  random. 
Suppose  that  A  is  an  initial  set  of  active  nodes.  We  define  a  random  variable 
p(A)  by  the  number  of  active  nodes  at  the  end  of  the  random  process  for 
the  LT  model.  Let  o(A)  denote  the  expected  value  of  (p(A).  We  call  o(A) 
the  influence  degree  of  A.  Note  that  these  notations  are  the  same  as  those 
for  the  IC  model. 
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2.3  Influence  Maximization  Problem 

We  mathematically  define  the  influence  maximization  problem  on  a  network 
G  =  (V,  E)  under  the  IC  and  LT  models.  Let  k  be  a  positive  integer  with 
k  <  N. 

The  influence  maximization  problem  on  G  of  size  k  is  defined  as  follows: 
Find  a  set  A*k  of  k  nodes  to  target  for  initial  activation  such  that  cx(d£)  > 
a (S)  for  any  set  S  of  k  nodes,  that  is,  find 

A*k  =  argmaxAe{ScV;  |S|=fc}  a(A),  (1) 

where  |<S|  stands  for  the  number  of  elements  of  set  S. 

3  Conventional  Method 

Kernpe  et  al.  (2003)  showed  the  effectiveness  of  the  greedy  algorithm  for 
the  influence  maximization  problem  under  the  IC  and  LT  models.  In  this 
section,  we  introduce  the  greedy  algorithm,  and  describe  the  conventional 
method  for  solving  the  influence  maximization  problem  under  the  greedy 
algorithm.  We,  then,  consider  evaluating  the  computational  complexity  for 
the  conventional  method. 

3.1  Greedy  Algorithm 

We  approximately  solve  the  influence  maximization  problem  by  the  following 
greedy  algorithm: 

(Gl)  Set  A  <-  0. 

(G2)  for  i  =  1  to  k  do 

(G3)  Choose  a  node  Vi  G  V  maximizing  a(A  U  {u}),  (v  €  V  \  A). 

(G4)  Set  A  <-  Au{vi}. 

(G5)  end  for 

Let  Ak  denote  the  set  of  k  nodes  obtained  by  this  algorithm.  We  refer  to 
Ak  as  the  greedy  solution  of  size  k.  Then,  it  is  known  that 

a(Ak)  >  ^1  -  ^  cr(Al), 

that  is,  the  quality  guarantee  of  Ak  is  assured  (Kernpe  et  ah,  2003).  Here, 
A*k  is  the  exact  solution  defined  by  Equation  (1). 

To  implement  the  greedy  algorithm,  we  need  a  method  for  estimating  all 
the  marginal  influence  degrees  {a{A  U  {u});  v  G  V  \  A}  of  A  in  Step  (G3)  of 
the  algorithm. 
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3.2  Conventional  Method  for  Estimating  Marginal  Influence 
Degrees 

For  Step  (G3)  of  the  greedy  algorithm,  the  conventional  method  estimated 
all  the  marginal  influence  degrees  {cr (Mu{^}) ;  v  €  H\H}  of  A  in  the  following 
way  (Kernpe  et  ah,  2003):  First,  a  sufficiently  large  positive  integer  M  is 
specified.  For  any  v  €  V  \  A,  the  random  process  of  the  diffusion  model 
(IC  or  LT  model)  is  run  from  the  initial  active  set  A  U  {u},  and  the  number 
( p(A  U  {u})  of  final  active  nodes  is  counted.  Each  a(A  U  {u})  is  estimated  as 
the  empirical  mean  obtained  from  M  such  simulations. 

Namely,  the  conventional  method  independently  estimated  cr(A  U  {u}) 
for  all  v  €  V  \  A  as  follows: 

1.  for  m  =  1  to  M  do 

2.  Compute  ip(A  U  {v}). 

3.  Set  xm  <—  tp(A  U  {u}). 

4.  end  for 

5.  Set  a{A  U  {u})  <-  (1/M)  Em=i  xm- 
Here,  each  <p(A  U  {u})  is  computed  as  follows: 

1.  Set  Hq  <r —  AU  {u}. 

2.  Set  t  <—  0. 

3.  while  Hf  ^  0  do 

4.  Set  Ht+ 1  <—  {the  activated  nodes  at  time  t  +  1}. 

5.  Set  t  <—  t  +  1. 

6.  end  while 

7.  Set  <y9(H  U  {u})  <-  Ej=o  \Hj\ 

3.3  Computational  Complexity  of  Conventional  Method 

We  consider  evaluating  the  computational  complexity  of  solving  the  influ¬ 
ence  maximization  problem.  For  this  purpose,  we  introduce  the  notion  of 
examined  nodes.  Here,  an  examined  node  is  a  node  that  is  actually  vis¬ 
ited  by  tracing  incoming  or  outgoing  links  on  the  graph  in  question  for  the 
method  when  all  the  marginal  influence  degrees  {er {A  U  {u});  v  €  V  \  A}  of 
A  are  estimated  in  Step  (G3)  of  the  greedy  algorithm.  In  Section  4.4,  we 
describe  the  reason  why  we  investigate  the  examined  nodes  for  evaluating 
the  computational  complexity. 
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The  computational  complexity  of  the  conventional  method  is  evaluated 
in  terms  of  the  expected  number  of  examined  nodes.  In  order  to  estimate 
a(A  U  {u}),  ( v  G  V  \  A),  it  is  necessary  for  any  v  G  V  \  A  to  simulate 
M  times  the  random  process  of  the  information  diffusion  model  (IC  or  LT 
model)  from  the  initial  active  set  A  U  {u}  on  graph  G.  For  each  simulation, 
the  set  of  examined  nodes  are  the  same  as  the  set  of  active  nodes  in  the 
process.  Thus,  we  can  estimate  that  the  expected  number  Co  of  examined 
nodes  for  the  conventional  method  is 

C0  =  M  Y,  ^UW).  (2) 

vev\A 


4  Proposed  Method 

We  propose  a  method  for  efficiently  estimating  all  the  marginal  influence 
degrees  {cr(A  U  {u});  v  G  V  \  A}  of  A  in  Step  (G3)  of  the  greedy  algorithm 
on  the  basis  of  bond  percolation  and  graph  theory,  and  evaluate  the  compu¬ 
tational  complexity,  and  compare  it  with  that  of  the  conventional  method. 

4.1  Bond  Percolation 

The  IC  and  LT  models  are  identified  with  bond  percolation  models  which  are 
defined  below,  and  all  the  marginal  influence  degrees  {o-(Au{w});  v  G  H\A} 
of  A  are  efficiently  estimated  by  exploiting  graph  theoretic  methods. 

A  bond  percolation  process  on  graph  G  =  (V,  E)  is  the  process  in  which 
each  link  of  G  is  randomly  designated  either  “occupied  ”  or  “unoccupied” 
according  to  some  probability  distribution.  Here,  in  terms  of  information 
diffusion  on  a  social  network,  occupied  links  represent  the  links  through 
which  the  information  propagates,  and  unoccupied  links  represent  the  links 
through  which  the  information  does  not  propagate.  Let  us  consider  the 
following  set  of  L-dimensional  vectors, 

Rg  =  |r  =  (ru,v)  (u,v)eE  e  {0>  1}  }  ’ 

where  L  is  the  number  of  links  in  G.  A  bond  percolation  process  on  G  is 
determined  by  a  probability  distribution  q(r)  on  Rg-  Namely,  for  a  random 
vector  r  G  Rg  drawn  from  q(r),  each  link  (u,v)  G  E  is  designated  “occupied” 
if  ru,v  =  1 ,  and  it  is  designated  “unoccupied”  if  rUyV  =  0.  Let  Er  denote  the 
set  of  all  the  occupied  links  for  r  G  Rg,  and  let  Gr  denote  the  graph  (V,Er). 
For  each  r  G  Rg,  we  can  consider  the  deterministic  diffusion  model  Mr  on 
Gr  such  that  F(A;Gr )  becomes  the  final  set  of  active  nodes  when  A  is  an 
initial  set  of  active  nodes,  where  F(A;  Gr )  is  the  set  that  is  reachable  from  A 
on  Gr  (see,  Section  2.1).  By  associating  the  diffusion  model  A4r  on  Gr  with 
a  probability  distribution  q(r)  on  Rg,  we  define  a  stochastic  diffusion  model 
on  G.  We  call  this  diffusion  model  the  bond  percolation  model  on  G,  and 
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refer  to  the  probability  distribution  q(r)  on  Rq  as  the  occupation  probability 
distribution  of  the  bond  percolation  model. 

We  easily  see  that  the  IC  model  on  G  can  be  identified  with  the  so-called 
susceptible/infective/recovered  (SIR)  model  (Newman,  2003)  for  the  spread 
of  a  disease  on  G ,  where  the  nodes  that  become  active  at  time  t  in  the  IC 
model  correspond  to  the  infective  nodes  at  time  t  in  the  SIR  model.  We 
recall  that  in  the  SIR  model,  an  individual  occupies  one  of  the  three  states, 
“susceptible”,  “infected”  and  ‘recovered”,  where  a  susceptible  individual 
becomes  infected  with  a  certain  probability  when  s/he  is  encountered  an 
infected  patient  and  subsequently  recovers  at  a  certain  rate  (see,  Newman, 
2003;  Watts  and  Dodds,  2007).  It  is  known  that  the  SIR  model  on  a  network 
can  be  exactly  mapped  onto  a  bond  percolation  model  on  the  same  network 
(Grassberger,  1983;  Newman,  2002;  Kernpe  et  ah,  2003;  Newman,  2003). 
Hence,  we  see  that  the  IC  model  on  G  is  equivalent  to  a  bond  percolation 
model  on  G ,  that  is,  these  two  models  have  the  same  probability  distribution 
for  the  final  set  of  active  nodes  given  a  target  set.  Here,  for  the  IC  model  on 
G,  the  occupation  probability  distribution  q(r)  of  the  corresponding  bond 
percolation  model  is  given  by 

q(r)=  II  {(Pu,vYu’v  (1  -  Pu,v)^ru’v}  ,  (r  G  RG), 

(u,v)€E 


that  is,  each  link  (it,  v)  of  G  is  independently  declared  to  be  “occupied” 
with  probability  pU)V ,  where  pU)V  is  the  propagation  probability  through  link 
(it,  v)  in  the  IC  model. 

On  the  other  hand,  Kernpe  et  al.  (2003)  proved  that  the  LT  model  on  G 
can  also  be  equivalent  to  a  bond  percolation  model  on  G  to  derive  the  result 
that  the  influence  degree  function  <j(A)  is  submodular  in  the  LT  model. 
Here,  for  the  LT  model  on  G,  the  corresponding  occupation  probability 
distribution  q(r)  is  generated  by  declaring  “occupied”  and  “unoccupied” 
links  in  the  following  way:  For  any  v  £  V,  we  pick  at  most  one  of  the 
incoming  links  to  v  by  selecting  link  (it,  v )  with  probability  wUtV  and  selecting 
no  link  with  probability  1  —  Y)ueTMwu,v  After  this  process,  the  picked 
links  are  declared  to  be  “occupied”  and  the  other  links  are  declared  to  be 
“unoccupied” .  Here,  wUjV  is  the  weight  of  link  (it,  v)  in  the  LT  model. 
Specifically,  q(r)  is  described  as  follows: 


d{r)  =  n  n  { (wu,v 

v€.V  u£T(v) 


1  ^  ^ 

u£T(v) 


(x  Y)uer(v)ru<v') 


where  if  Suer(it)  wu,v  <  1)  Xmer(u)  ru,v  Si  1  and  if  SuGr(^)  wu,v  —  1)  Xmer(u)  ru,v 

=  1. 
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4.2  Proposed  Method  for  Estimating  Marginal  Influence  De¬ 
grees 

We  present  a  method  of  estimating  all  the  marginal  influence  degrees  {a(ALi 
{u});  v  £  V  \  A}  of  A  in  Step  (G3)  of  the  greedy  algorithm.  As  shown  in 
the  preceding  section,  the  IC  and  LT  models  on  G  can  be  identified  with 
the  bond  percolation  models  on  G.  Therefore,  we  have 

cr(Au{u})  =  «(r)  l-F^U  M;Gr)| 

r£RG 

for  any  v  £  V  \  A,  where  q(r)  is  the  corresponding  occupation  probability 
distribution,  and  F(A  U  {u};  Gr)  stands  for  the  set  of  all  the  nodes  that  are 
reachable  from  A  U  {u}  on  graph  Gr  (see,  Section  2.1). 

We  estimate  {<r(A  U  {u});  v  £  V  \  A}  in  the  following  way:  First,  we 
specify  a  sufficiently  large  positive  integer  M.  Next,  we  independently  gen¬ 
erate  a  set  {ri,  •  •  •  ,tm}  of  M  sample  vectors  on  Rq  from  the  probability 
distribution  q(r);  that  is,  independently  generate  a  set  {Grm;m  =  1,  •  •  • ,  M} 
of  M  graphs.  For  any  v  £  V  \  A,  we  approximate  cr (A  U  {u})  by 

M 

a(A  U  M)  -JjT.  I^u  W;  Gr„)|.  (3) 

m=  1 

Thus,  we  estimate  {v(A  U  {u});  v  G  V  \  A}  on  the  basis  of  Equation  (3)  as 
follows: 

1.  for  m  =  1  to  M  do 

2.  Generate  graph  Grm. 

3.  Compute  {| F(A  U  {u};  Grm) |;  v  G  V  \  A}. 

4.  Set  xv>m  <—  \F  (A  U  {u};  GVm) \  for  all  v  G  V  \  A. 

5.  end  for 

6.  Set  (t(A  U  {u})  (1/M)  J2m= l  xv,m.  for  all  v  G  V  \  A. 

In  particular,  we  evaluate  {| F(A  U  {u};  Gr)|;  v  G  V  \  A}  for  an  arbitrary  r 
£  Rg  by  the  following  algorithm: 

(El)  Find  the  subset  F(A ;  Gr)  of  V. 

(E2)  Set  | F(A  U  {u};  Gr)|  e-  \F(A;  Gr)\  for  all  v  £  F(A\  Gr)  \  A. 

(E3)  Find  the  subset  VrA  =  V  \  F(A ;  Gr )  of  V,  and  the  induced  graph  GA 
of  Gr  to  VA. 

(E4)  Set  U  e-  0. 
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(E5)  while  VrA  \  U  +  0  do 
(E6)  Pick  a  node  u  £  VrA  \  U. 

(E7)  Find  the  subset  F(u-  GA )  of  VA . 

(E8)  Find  the  subset  C(u]GA)  =  B(u;GA)  fl  F(u;GA)  of  F{u]GA). 
(E9)  Set  |F(Au{w};  Gr)\  <-  \F(u;  GA)\  +  \F(A;  Gr)\  for  all  v  £  C(u;  GA). 
(E10)  Set  U  <-  UuC{u-Ga). 

(Ell)  end  while 

Now,  we  explain  this  algorithm.  In  Step  (El),  we  find  the  subset  F(A;  Gr ) 
that  is  reachable  from  A  on  graph  Gr.  In  Step  (E2),  we  use  the  fact  that  if 
v  £  F(A\  Gr),  the  set  F(ylU{r;};  Gr)  that  is  reachable  from  7lU{r>}  on  Gr  is 
equal  to  the  set  F{A\  Gr),  and  we  simultaneously  compute  \F(A  U  {f};  Gr ) | 
for  all  v  £  F(A;Gr).  In  Step  (E3),  we  find  the  subset  VA  =  V  \  F(A;Gr), 
and  also  find  the  induced  graph  GA  of  graph  Gr  to  VA.  In  Steps  (E4)  to 
(Ell),  we  use  the  fact  that  if  v  ^  F(A\Gr),  \F(A  U  {n};Gr)|  is  obtained 
by  the  sum  of  \F(A'1Gr)\  and  \F(v;GA)\.  This  fact  enables  us  to  reduce 
the  graph  in  question  from  Gr  to  GA .  We  attempt  to  decompose  graph  GA 
into  its  SCCs.  In  Step  (E6),  on  graph  GA ,  we  pick  a  node  u  that  does  not 
belong  to  the  SCCs  that  we  have  already  found.  In  Step  (E7),  we  find  the 
set  F[u ;  GA)  that  is  reachable  from  u  on  graph  GA .  In  Step  (E8),  we  find 
the  subset  C(u;GA )  =  B{u\GA )  n  F(ir,GA)  of  F(u;GA)  by  tracing  back¬ 
ward  all  the  links  from  u  on  the  induced  graph  of  GA  to  F(u;GA).  Note 
that  the  set  C{u]GA)  is  equal  to  the  SCC  SCC(u;GA )  that  contains  u.  In 
Step  (E9),  we  use  the  fact  that  | F(v;  GA)\  =  | F(u;  GA)\  if  v  £  C{u\  GA),  and 
simultaneously  compute  ^(^Ujn};  Gr)|  for  all  v  £  C(u\  GA).  We  illustrate 
the  flow  of  the  algorithm  in  the  following  example: 

Example:  We  consider  the  graph  Gr  shown  in  Figure  la,  where  V  =  {fi, 
V2,  V3,  V4,  v$,  vq,  v?}.  We  set  A  =  {fi}.  In  this  case,  the  process  of  the 
algorithm  proceeds  as  follows. 

In  Step  (El),  we  find  F(A]Gr)  =  {v\ ,  V2,  ^3}.  In  Step  (E2),  we  find 
| F(A  U  {v2}',Gr)\  =  \F(A  U  {r>3};  Gr)|  =  3.  In  Step  (E3),  we  hnd  VA  = 
{f4,  V5,  vq,  V7}  and  GA  as  shown  in  Figure  lb.  In  Step  (E4),  we  set  U  = 
0.  In  Step  (E5),  we  check  VA  \  U  =  {1)4,  vs,  vq,  V7}  /  0.  In  Step  (E6), 
we  pick  V4  £  VA  \  U.  In  Step  (E7),  we  hnd  F(v4]GA)  =  {^4,  vs,  v$,  V7}. 
In  Step  (E8),  we  hnd  C(v±;  GA)  =  B(v4]GA)  fl  F(v4]GA)  =  {rq,  vs,  ^6}  in 
F(v4]Ga).  In  Step  (E9),  we  hnd  \F(A  U  {^4};  C7r) |  =  \F(A  U  {^5};  C7r) |  = 
|F(ylU{r;6};  Gr)  \  =  7.  In  Step  (E10),  we  set  U  =  {iq,  V5,  } .  In  Step  (Ell), 

we  return  to  Step  (E5).  In  Step  (E5),  we  check  VA  \  U  =  {^7}  7^  0.  In 
Step  (E6),  we  pick  V7  £  VA  \  U.  In  Step  (E7),  we  hnd  F(vt,  Ga)  =  {V7}.  In 
Step  (E8),  we  hnd  C(vt,Ga)  =  { v 7}.  In  Step  (E9),  we  hnd  |F(ylU{n7};  Gr)\ 
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(a)  An  example  of 
graph  Gr. 


Figure  1:  An  illustration  of  the  flow  of  the  proposed  algorithm  for  evaluating 
{|F(A  U  {u};  Gr) |;  v  £  V  \  A},  where  r  6  Rq  and  A  =  {ui}. 


=  4.  In  Step  (E10),  we  set  U  =  {v±,  v§,  vq,  U7}.  In  Step  (Ell),  we  return  to 
Step  (E5).  In  Step  (E5),  we  check  V rA  \  U  =  0.  Then,  the  process  of  the 
algorithm  ends. 

4.3  Computational  Complexity  of  Proposed  Method 

In  the  same  way  as  in  Section  3.3,  we  evaluate  the  computational  complexity 
of  the  proposed  method  as  the  expected  number  of  examined  nodes  for 
estimating  all  the  marginal  influence  degrees  {&(A  U  {u});  v  £  V  \  A}  of  A 
in  Step  (G3)  of  the  greedy  algorithm. 

Let  Gr  be  a  graph  generated  from  the  occupation  probability  distribution 
q{r)  of  the  corresponding  bond  percolation  model.  We  consider  evaluating 
the  expected  number  Z(A,  Gr)  of  examined  nodes  for  computing  {| F(A  U 
{u};  Gr)\]  v  £  V  \  A}  by  the  proposed  method  (see,  Section  4.2).  First,  the 
number  of  examined  nodes  for  finding  F(A]  Gr )  is  given  by  \F(A;  Gr)\.  Let 

VrA=  (J  SCC(u;  C4) 

ueuA 

be  the  SCC  decomposition  of  the  induced  graph  GA  of  Gr  to  VrA  =  V  \ 
F(A-,Gr),  where  UA  stands  for  the  set  of  all  the  representative  nodes  for 
SCCs.  For  any  u  £  UA,  the  number  of  examined  nodes  for  finding  F(u\  GA) 
is  \F{u]Ga)\.  Suppose  now  that  F(u\GA)  is  found.  Then,  the  number 
of  examined  nodes  for  finding  C{u]GA )  (=  SCC(u;GA ))  is  \SCC(w,GA)\, 
since  C(u\  GA)  =  B(u\  GA)  n  F(u\  GA)  is  calculated  on  the  induced  graph  of 
graph  Ga  to  F(u ;  GA).  Therefore,  the  number  Z(A,  Gr)  of  examined  nodes 
for  computing  {|F(A  U  {u};  Gr)|;  v  £  V  \  A}  by  the  proposed  method  is  as 
follows: 

Z(A,Gr)  =  \F(A-Gr)\  +  £  (\F(u-,Ga)\  +  \SCC(u;Ga)\). 
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By  the  definition  of  graph  G Jr,  we  have 

E  \SCC{u-G?)\  =  N  -\F{A-Gr)\, 

ueuf 


where  N  =  \V\.  Thns,  we  have 

Z(A,Gr)  =  N  +  E  \F{u]G^)\.  (4) 

■ueuf 

Since  \F{u-G*)\  =  \F(A  U  {«};  Gr)|  -  \F(A;Gr)\,  we  can  estimate  the  ex¬ 
pected  value  of  \F(u;G^)\  as  a  (A  U  {«})  —  cr(A).  Hence,  by  Equation  (4), 
we  can  estimate  the  expected  number  Z(A.  Gr)  of  examined  nodes  for  com¬ 
puting  {|.F(yl  U  {u};  Gr ) | ;  v  €  V  \  A}  as 


Z(A,  Gr)  =  N+(  E  u  M)  "  *{*))  )  » 

\u£Uf  /  r 

where  ( f(r))r  stands  for  the  operation  that  averages  f(r)  with  respect  to  r 
under  q(r),  that  is, 

( f(r))r=  E  /(r)^(^)- 

rei?(G) 

From  the  above  results,  we  can  estimate  that  the  expected  number  C\  of 
examined  nodes  for  the  proposed  method  is 


Ci  =  M  {  N  + 


E 

uet/,4 


(a(Au{u}) 


(5) 


4.4  Computational  Complexity  Comparison 

We  compare  the  proposed  method  with  the  conventional  method  in  terms 
of  computational  complexity.  Both  methods  need  M  to  be  specified  as  a 
parameter,  and  we  use  the  same  value  for  both.  We  note  that  more  coin¬ 
flips  are  used  in  the  conventional  method.  In  fact,  if  we  think  of  a  single 
run,  i.e.,  any  one  of  the  M  runs,  the  expected  number  of  coin-flips  for  the 
conventional  method  is  0(\V\a(v))  for  both  the  IC  and  LT  models,  whereas 
that  for  the  proposed  method  is  0(\E\)  for  the  IC  model  and  0(|V|)  for  the 
LT  model.  Note  that  in  case  of  LT  model  for  the  proposed  method,  the  coin¬ 
flip  is  realized  by  roulette  for  each  node,  i.e.,  picking  at  most  one  incoming 
link.  However,  if  we  focus  on  a  single  node  v  for  initial  activation  from 
which  to  propagate  the  information,  the  number  of  coin-flips  are  0(a(v)) 
for  both  the  conventional  and  the  proposed  methods  and  for  both  the  IC 
and  the  LT  models  because  only  the  activated  nodes  (the  expected  number 
is  cr(v))  are  on  the  paths  that  lead  to  reachable  nodes  from  v  in  the  proposed 
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method.  Thus  by  using  the  same  value  of  M,  both  would  estimate  &(v)  with 
the  same  accuracy  in  principle  (see  Appendix  A).  The  biggest  difference 
is  that  in  the  conventional  method,  when  A  is  not  empty,  many  of  the 
coin-flips  are  redundant;  that  is,  the  diffusion  process  from  A  is  repeatedly 
performed,  whereas  in  the  proposed  method,  no  such  repetition  is  made. 
This  contributes  to  the  stability  of  the  proposed  method.  Below  we  begin 
by  explaining  the  reason  why  we  investigate  the  examined  nodes  to  compare 
the  proposed  and  the  conventional  methods. 

First,  we  consider  the  case  of  IC  model.  Both  the  proposed  and  the 
conventional  methods  flip  a  coin  with  a  bias  pUjV  on  a  link  (u,  v )  to  decide 
whether  to  propagate  the  information  through  the  link  (u,  v )  or  not.  Here, 
if  we  assume  that  all  the  coins  are  flipped  in  advance  for  the  conventional 
method  and  ignore  the  computational  complexity  for  flipping  a  coin  and 
deciding  whether  or  not  to  propagate  the  information,  then  for  both  the 
proposed  and  the  conventional  methods,  the  major  computation  is  to  trace 
forward  or  backward  the  links  the  information  propagates  and  identify  the 
nodes  to  visit.  Therefore,  we  evaluate  the  computational  complexities  of  the 
both  methods  for  the  IC  model  in  terms  of  the  expected  number  of  examined 
nodes. 

Next,  we  consider  the  case  of  LT  model.  For  the  proposed  method,  we 
ignore  the  computational  complexity  for  the  process  of  choosing  at  most 
one  incoming  link  of  each  node  in  the  original  graph.  For  the  conventional 
method,  we  ignore  the  computational  complexity  for  the  process  of  choosing 
the  threshold  0V  of  each  node  v  in  the  original  graph.  Note  that  the  proposed 
method  performs  the  process  M  times,  whereas  the  conventional  method 
performs  the  process  M N  times.  Moreover,  for  the  conventional  method,  we 
further  ignore  the  computational  complexity  for  adding  the  weights  from  the 
neighboring  active  nodes  to  a  node  and  deciding  whether  the  node  becomes 
active  or  not.  Then,  the  major  computation  for  the  conventional  method 
is  to  trace  forward  the  links  the  information  propagates  and  identify  the 
nodes  to  visit.  Therefore,  we  also  evaluate  the  computational  complexities 
of  the  both  methods  for  the  LT  model  in  terms  of  the  expected  number  of 
examined  nodes. 

Now,  we  compare  the  proposed  and  the  conventional  methods  in  terms  of 
the  expected  number  of  examined  nodes.  We  use  the  results  in  Sections  3.3 
and  4.3.  By  Equation  (2),  the  expected  number  Co  of  examined  nodes  for 
the  conventional  method  can  be  estimated  as 

Co  =  m\n-\A\+  £  (a(A  U  {«})  -  1)  L  (6) 

[  uev\A  J 

since  XV\A  1  =  N  —  |A|.  In  Equation  (6),  we  can  expect  that  \A\  <C  N 
(=  |  V|),  and  u(ylU{u})  —  1  is  summed  up  for  almost  all  since  k  <C  N. 

On  the  other  hand,  we  can  generally  expect  \U^\  <  IV  in  Equation  (5). 
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Also,  we  have  cr(A)  >  1  in  the  greedy  algorithm  if  d  /  0.  Moreover,  for 
any  u  6  V  \  A,  o(A  U  {u})  —  a  (A)  decreases  as  \A\  increases,  since  cr(A)  is 
a  submodular  function.  Hence,  we  can  generally  expect  that  in  Step  (G3) 
of  the  greedy  algorithm,  the  proposed  method  has  much  smaller  expected 
number  of  examined  nodes  than  the  conventional  method. 

From  the  above  results,  we  can  expect  that  compared  with  the  con¬ 
ventional  method,  the  proposed  method  will  achieve  a  large  reduction  in 
computational  cost. 

5  Experimental  Evaluation 

Using  large-scale  real  networks,  we  experimentally  evaluated  the  perfor¬ 
mance  of  the  proposed  method. 

5.1  Network  Datasets 

In  the  evaluation  experiments,  we  should  desirably  use  large-scale  networks 
that  exhibit  many  of  the  key  features  of  real  social  networks.  Here,  we  show 
the  experimental  results  for  two  different  datasets  of  such  real  networks. 


oo  o  o 

o  OTfflDOO 

-4 

oo  (ftnrano)  o  o  o 

10°  io1  io2 

Out-degree 


Figure  2:  The  out-degree  distribution  for  the  blog  dataset. 

First,  we  employed  a  trackback  network  of  blogs,  since  a  piece  of  infor¬ 
mation  can  propagate  from  one  blog  author  to  another  blog  author  through 
a  trackback,  where  a  trackback  is  a  kind  of  hyperlink  with  a  linkback  (i.e. , 
link  notification)  function.  We  exploited  the  blog  “Theme  salon  of  blogs” 
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Figure  3:  The  in-degree  distribution  for  the  blog  dataset. 


in  the  site  “goo”  (http://blog.goo.ne.jp/usertheme/),  where  blog  au¬ 
thors  could  recruit  trackbacks  of  other  blog  authors  by  registering  interest¬ 
ing  themes.  We  collected  a  large-scale  connected  trackback  network  in  May, 
2005  by  the  following  breadth  first  search  process: 

1.  We  started  the  process  from  the  blog  of  the  theme  “JR  Fukuchiyama 
Line  Derailment  Collision”  in  the  site  “goo”,  analyzed  its  HTML  file, 
and  extracted  the  list  of  the  URLs  of  the  source  blogs  of  the  trackbacks 
to  this  blog. 

2.  For  each  list  obtained,  we  collected  the  blogs  of  the  URLs  in  the  list. 

3.  For  each  blog  collected,  we  analyzed  its  HTML  file,  and  constructed 
the  list  of  the  URLs  of  the  source  blogs  of  the  trackbacks  to  the  blog. 

4.  We  repeated  from  Step  2  until  depth  ten  from  the  original  blog. 

We  call  this  network  data  the  blog  dataset.  This  network  was  a  directed 
graph  of  12, 047  nodes  and  53, 315  links,  and  is  expected  to  have  a  feature 
of  real  world  social  network  in  light  of  the  way  it  is  generated.  To  confirm 
this,  the  out-degree  and  in-degree  distributions  are  respectively  plotted  in 
Figures  2  and  3,  from  which  it  is  understood  that  these  are  “heavy-tailed” 
distributions  that  most  large  real  networks  exhibit.  Here,  the  out-degree 
and  in-degree  distributions  are  the  distributions  of  the  number  of  outgoing 
and  incoming  links  for  every  node,  respectively.  Thus,  we  believe  that  the 
blog  dataset  is  a  typical  example  of  a  large  real  social  network  represented 
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by  a  directed  graph,  and  can  be  used  as  the  network  data  to  evaluate  the 
performance  of  the  proposed  method. 
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Degree 


Figure  4:  The  degree  distribution  for  the  Wikipedia  dataset. 

Next,  we  employed  a  network  of  people  that  was  derived  from  the  “list  of 
people”  within  Japanese  Wikipedia.  Specifically,  we  extracted  the  maximal 
connected  component  of  the  undirected  graph  obtained  by  linking  two  peo¬ 
ple  in  the  “list  of  people”  if  they  co-occur  in  six  or  more  Wikipedia  pages, 
and  constructed  a  directed  graph  by  regarding  those  undirected  links  as  bidi¬ 
rectional  ones.  We  call  this  network  data  the  Wikipedia  dataset.  The  total 
numbers  of  nodes  and  directed  links  were  9, 481  and  245,  044,  respectively. 
Compared  with  the  blog  network,  the  way  this  network  is  generated  is  rather 
synthetically.  Figure  4  shows  the  degree  distribution  of  the  undirected  graph. 
We  also  observe  that  the  degree  distribution  is  a  “heavy-tailed”  distribution. 

For  social  networks  represented  as  undirected  graphs,  Newman  and  Park 
(2003)  observed  that  they  generally  have  the  following  two  statistical  prop¬ 
erties  that  non-social  networks  do  not  have.  First,  they  show  positive  cor¬ 
relations  between  the  degrees  of  adjacent  nodes.  Second,  they  have  much 
higher  values  of  the  clustering  coefficient  than  the  corresponding  configura¬ 
tion  models  (i.e.,  random  network  models).  Here,  the  clustering  coefficient 
C  for  an  undirected  graph  is  defined  by 

3  x  number  of  triangles  on  the  graph 
number  of  connected  triples  of  nodes  ’ 

where  a  “triangle”  means  a  set  of  three  nodes  each  of  which  is  connected 
to  each  other,  and  a  “connected  triple”  means  a  node  connected  directly  to 
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Degree 


Figure  5:  The  degree  correlation  for  the  Wikipedia  dataset. 


unordered  other  pair  nodes.  Note  that  in  terms  of  sociology,  C  measures  the 
probability  that  two  of  your  friends  will  also  be  friends  each  other.  Given  a 
degree  distribution  {A^},  the  corresponding  configuration  model  of  a  random 
network  of  N  nodes  is  defined  as  the  ensemble  of  all  possible  undirected 
graphs  of  N  nodes  that  possess  the  degree  distribution  {A^},  where  A d  is 
the  fraction  of  nodes  in  the  network  having  degree  d.  It  is  known  [18]  that 
the  value  of  C  for  the  configuration  model  is  exactly  calculated  by 


C 


1 


2 


where 

zi  =  E  dXd 

d 

is  the  average  number  of  neighbors  of  a  node  and 


*2  =  E  -  E  dXd 

d  d 


is  the  average  number  of  second  neighbors.  For  the  undirected  graph  of  the 
Wikipedia  dataset,  the  value  of  C  of  the  corresponding  configuration  model 
was  0.046,  while  the  actual  measured  value  of  C  was  0.39.  Namely,  the 
undirected  graph  of  the  Wikipedia  dataset  had  a  much  higher  value  of  the 
clustering  coefficient  than  the  corresponding  configuration  model.  Moreover, 
we  can  see  from  Figure  5  that  the  Wikipedia  dataset  had  weakly  positive 
degree  correlation.  Therefore,  we  believe  that  the  Wikipedia  dataset  is  also 
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a  typical  example  of  a  large  real  social  network  represented  by  an  undirected 
graph,  and  can  be  used  as  the  network  data  to  evaluate  the  performance  of 
the  proposed  method. 

5.2  Experimental  Settings 

The  proposed  and  the  conventional  methods  are  equipped  with  parameter 
M.  We  refer  to  the  conventional  method  with  M  =  1,  000  for  the  IC  model 
as  the  IC1000.  In  the  same  way,  we  define  the  LT1000  and  LT 10000  for 
the  conventional  method  with  the  LT  model.  We  also  refer  to  the  proposed 
method  using  M  =  1, 000  and  M  =  10,  000  for  the  IC  model  as  the  ICBP1000 
and  ICBP10000,  respectively.  In  the  same  way,  we  define  the  LTBP1000  and 
LTBP10000  for  the  proposed  method  with  the  LT  model.  As  described  in 
Section  4.4,  we  compare  these  methods  for  the  same  value  of  M. 

The  IC  and  LT  models  have  parameters  to  be  specified  in  advance.  In  the 
IC  model,  we  assigned  a  uniform  probability  p  to  the  propagation  probability 
pU:V  for  any  directed  link  (it,  v )  of  the  network,  that  is,  pu>v  =  p.  In  the  LT 
model,  we  uniformly  set  weights  as  follows:  For  any  node  v  of  the  network, 
the  weight  wu>v  from  a  parent  node  u  €  T(u)  is  given  by  wUjV  =  l/|T(u)|. 

We  implemented  all  our  programs  of  both  the  conventional  and  proposed 
methods  for  the  IC  and  LT  models  in  C  language.  Of  course,  the  basic 
structure  of  these  programs  is  the  same,  except  that  the  routines  of  active 
node  calculation  used  in  the  conventional  method  are  replaced  with  those 
of  bond  percolation  and  SCC  decomposition  used  in  the  proposed  method. 

5.3  Experimental  Results 

We  compared  the  proposed  method  with  the  conventional  method  in  terms 
of  both  the  performance  of  the  approximate  solution  A k  and  the  processing 
time  for  solving  the  influence  maximization  problem  of  size  k.  The  per¬ 
formance  of  Ak  is  measured  by  the  influence  degree  o(Ak).  We  estimated 
cj(Afc)  by  using  300,000  simulations  according  to  the  work  of  Kempe  et  al. 
(2003).  All  our  experimentation  was  undertaken  on  a  single  Dell  PC  with  an 
Intel  3.4GHz  Xeon  processor,  with  2GB  of  memory,  running  under  Linux. 

In  order  to  keep  computational  time  at  a  reasonable  level  for  the  conven¬ 
tional  method,  we  mainly  compared  these  two  methods  using  M  =  1,000. 
Note  that  if  a  large  enough  M  is  taken,  these  two  methods  should  produce 
the  same  solution.  We  conjecture  that  M  =  1,  000  is  not  large  enough,  that 
is,  these  two  methods  with  M  =  1,000  cannot  necessarily  obtain  good  ap¬ 
proximate  values  for  the  marginal  influence  degrees  {<j(4u{d});  v  G  C\A} 
of  A,  (see  Appendices  A  and  B).  Thus,  we  iterated  the  same  experiment 
five  times  independently.  Tables  1  and  2  show  the  experimental  results  for 
the  IC  model  with  p  =  10%  and  the  LT  model  for  the  blog  dataset,  respec¬ 
tively,  where  the  values  are  rounded  to  three  significant  figures.  Note  that 
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Table  1:  Performance  of  approximate  solutions  for  the  influence  maximiza¬ 
tion  problem  under  the  IC  model  with  p  =  10%  for  the  blog  dataset.  Up¬ 
per:  IC1000  (the  conventional  method).  Lower:  ICBP1000  (the  proposed 
method) . 


k 

0 

(Ak) 

(IC1000) 

1 

1.74 

X 

10- 

1.74 

X 

102 

1.74 

X 

102 

1.74 

X 

102 

1.74 

X 

10 

10 

6.93 

X 

102 

6.98 

X 

102 

6.93 

X 

102 

6.91 

X 

102 

6.95 

X 

10 

20 

8.58 

X 

102 

8.61 

X 

102 

8.57 

X 

102 

8.58 

X 

102 

8.60 

X 

10 

30 

9.59 

X 

102 

9.69 

X 

102 

9.68 

X 

102 

9.66 

X 

102 

9.78 

X 

10 

k 

o(Ak)  (ICBP1000) 

1 

1.74  x  102 

1.74  x  102 

1.74  x  102 

1.74  x  102 

1.74  x  102 

10 

7.02  x  102 

7.01  x  102 

7.00  x  102 

7.01  x  102 

7.02  x  102 

20 

8.74  x  102 

8.75  x  102 

8.73  x  102 

8.74  x  102 

8.73  x  102 

30 

9.91  x  102 

9.92  x  102 

9.90  x  102 

9.92  x  102 

9.92  x  102 

in  these  tables  and  later  ones,  too,  the  values  are  reestimated  with  300, 000 
simulations  once  Ak  has  been  obtained  by  each  method  with  a  specified 
M.  Since  the  true  solution  o' (At)  is  by  definition  the  maximum  among  all 
o(Ak),  if  o(Ak)  is  estimated  accurately,  it  makes  sense  to  argue  that  the 
larger  the  value  is,  the  closer  it  is  to  the  true  solution  and  thus  it  is  of  bet¬ 
ter  quality.  We  first  observe  that  the  results  for  the  proposed  method  were 
relatively  stable  over  the  iterations,  while  the  results  for  the  conventional 
method  somewhat  fluctuated  for  large  k  in  particular.  Here,  we  note  that 
the  proposed  method  using  M  =  10,  000  was  stable  and  always  produced  the 
same  solution  for  k  =  30  over  the  iterations  (not  shown  in  the  tables).  We 
also  observe  that  for  k  =  30,  the  solutions  by  the  ICBP1000  and  LTBP1000 
outperforms  those  by  the  IC1000  and  LT1000,  respectively. 

Table  3  shows  the  processing  time  to  obtain  Ak  by  the  IC1000,  ICBP1000, 
LT1000  and  LTBP1000  for  the  blog  dataset,  where  the  values  are  rounded 
to  three  significant  figures.  We  observe  from  Table  3  that  the  ICBP1000 
and  LTBP1000  are  much  more  efficient  than  the  IC1000  and  LT1000,  re¬ 
spectively.  For  example,  to  obtain  the  approximate  solution  H30  for  k  =  30, 
both  the  IC1000  and  LT1000  needed  about  2.5  days,  while  the  ICBP1000 
and  LTBP1000  needed  about  2.5  and  1.5  minutes,  respectively.  Namely, 
for  k  =  30,  the  ICBP1000  was  1.8  x  103  times  faster  than  the  IC1000,  and 
the  LTBP1000  was  4.6  x  103  times  faster  than  the  LT1000.  We  also  exam¬ 
ined  the  LT10000  and  LTBP10000  on  the  blog  dataset.  In  order  to  obtain 
approximate  solution  2I30,  the  LT10000  needed  about  27  days,  while  the 
LTBP 10000  needed  only  about  14  minutes. 
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Table  2:  Performance  of  approximate  solutions  for  the  influence  maximiza¬ 
tion  problem  under  the  LT  model  for  the  blog  dataset.  Upper:  LT1000  (the 
conventional  method).  Lower:  LTBP1000  (the  proposed  method). 
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a(Ak)  (LT1000) 


1 

2.86 

X 

102 

2.86 

X 

102 

2.86 

X 

102 

2.86 

X 

102 

2.86 

X 

102 

10 

1.59 

X 

103 

1.61 

X 

103 

1.61 

X 

103 

1.59 

X 

103 

1.58 

X 

103 

20 

2.41 

X 

103 

2.40 

X 

103 

2.42 

X 

103 

2.42 

X 

103 

2.38 

X 

103 

30 

3.02 

X 

103 

3.05 

X 

103 

3.01 

X 

103 

3.01 

X 

103 

3.00 

X 

103 

k 

<r(Ak)  (LTBP1000) 

1 

2.86 

X 

102 

2.86 

X 

102 

2.86 

X 

102 

2.86 

X 

102 

2.86 

X 

102 

10 

1.60 

X 

103 

1.61 

X 

103 

1.61 

X 

103 

1.59 

X 

103 

1.60 

X 

103 

20 

2.44 

X 

103 

2.44 

X 

103 

2.44 

X 

103 

2.44 

X 

103 

2.44 

X 

103 

30 

3.07 

X 

103 

3.07 

X 

103 

3.06 

X 

103 

3.06 

X 

103 

3.06 

X 

103 

Table  3:  Processing  time  (sec.)  for  the  blog  dataset. 


k 

IC1000 

ICBP1000 

LT1000 

LTBP1000 

1 

3.70  x  102 

7.07 

6.57  x  102 

3.19 

10 

4.69  x  104 

5.68  x  101 

4.24  x  104 

2.96  x  101 

20 

1.24  x  105 

1.09  x  102 

1.25  x  105 

5.64  x  101 

30 

2.13  x  105 

1.60  x  102 

2.32  x  105 

8.20  x  101 

Tables  4,  5  and  6  show  the  experimental  results  for  the  Wikipedia  dataset. 
We  see  that  the  results  were  qualitatively  very  similar  to  the  ones  for  the 
blog  dataset.  First,  the  solutions  by  the  ICBP1000  and  LTBP1000  outper¬ 
formed  those  by  the  IC1000  and  LT1000,  respectively.  We  also  note  that  the 
proposed  method  using  M  =  10,  000  was  stable  and  always  produced  the 
same  solution  for  k  =  30  over  the  iterations  (not  shown  in  the  tables).  Next, 
the  ICBP1000  and  LTBP1000  were  much  more  efficient  than  the  IC1000  and 
LT1000,  respectively.  For  example,  for  obtaining  the  approximate  solution 
2I30  for  k  =  30,  the  ICBP1000  was  1.9  x  103  times  faster  than  the  IC1000, 
and  the  LTBP1000  was  8.3  x  103  times  faster  than  the  LT1000.  We  also 
conducted  experiments  on  some  other  large-scale  real  networks  including  a 
blogroll  network  of  blogs,  and  confirmed  the  effectiveness  of  the  proposed 
method. 
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Table  4:  Performance  of  approximate  solutions  for  the  influence  maximiza¬ 
tion  problem  under  the  IC  model  with  p  =  1%  for  the  Wikipedia  dataset. 
Upper:  IC1000  (the  conventional  method).  Lower:  ICBP1000  (the  proposed 
method) . 


k 

£T 

■(Ak)  (IC1000) 

1  1.39  x  102 

1.39  x  102 

1.36  x  102 

1.36  x  102 

1.36  x  102 

10  3.91  x  102 

3.97  x  102 

3.98  x  102 

4.02  x  102 

4.01  x  102 

20  4.56  x  102 

4.64  x  102 

4.62  x  102 

4.64  x  102 

4.66  x  102 

30  4.97  x  102 

5.02  x  102 

4.95  x  102 

5.00  x  102 

4.98  x  102 

k 

a(Ak)  (ICBP1000) 

1 

1.39  x  102 

1.39  x  102 

1.39  x  102 

1.36  x  102 

1.36  x  102 

10 

4.05  x  102 

4.06  x  102 

4.07  x  102 

4.06  x  102 

4.07  x  102 

20 

4.75  x  102 

4.76  x  102 

4.76  x  102 

4.75  x  102 

4.77  x  102 

30 

5.16  x  102 

5.17  x  102 

5.17  x  102 

5.16  x  102 

5.17  x  102 

5.4  Discussion 

These  experimental  results  show  that  the  proposed  method  is  much  more 
efficient  than  the  conventional  method. 

First,  we  investigate  the  reason  why  the  proposed  method  outperforms 
the  conventional  method  in  the  case  of  M  =  1, 000  for  our  network  datasets. 
If  we  take  a  sufficiently  large  M  (e.g.,  M  =  100,000),  the  proposed  and  the 
conventional  methods  should  produce  the  same  solution.  As  shown  in  the 
experiments,  the  estimation  accuracy  of  influence  degree  function  a  with 
M  =  1,000  is  not  so  high  for  the  both  methods.  Now,  consider  estimating 
all  the  marginal  influence  degrees  {a (Ak  U  {u});  v  €  V  \  A k}  of  solution 
Ak,  and  choosing  the  node  vk+\  that  maximizes  a (Ak  U  {u}),  (v  €  V  \  A k). 
It  should  be  reemphasized  that  the  influence  set  of  Ak  is  equally  evaluated 
for  all  v  £  V  \  Ak  for  the  proposed  method.  In  fact,  when  cr(Ak  U  {u})  is 
estimated  using  Equation  (3),  each  |F(AfcU{u};  Grm)\  is  basically  computed 
by 

\F(AkU{v}-,Grm)\  =  \F(v,Gt)\  +  \F(Ak-GrJ\. 

Thus,  for  the  proposed  method,  a  node  that  is  relatively  optimal  for  Ak  can 
be  selected  as  vk+\.  On  the  other  hand,  for  the  conventional  method,  the 
influence  set  of  Ak  is  not  equally  evaluated  for  all  v  G  V\Ak  since  cr(AfcU{u}) 
is  independently  estimated  for  every  v  each  by  a  distinct  simulation.  We  also 
note  that  the  number  of  final  active  nodes  for  a  given  target  set  greatly  varied 
for  every  simulation  in  the  IC  and  LT  models  (see,  Appendix  B).  Thus, 
unlike  the  proposed  method,  the  selection  of  vk+\  in  the  conventional  method 
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Table  5:  Performance  of  approximate  solutions  for  the  influence  maximiza¬ 
tion  problem  under  the  LT  model  for  the  Wikipedia  dataset.  Upper:  LT1000 
(the  conventional  method).  Lower:  LTBP1000  (the  proposed  method). 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 


a(Ak)  (LT1000) 


1 

3.41 

X 

102 

3.41 

X 

102 

3.41 

X 

102 

3.41 

X 

102 

3.41 

X 

102 

10 

1.72 

X 

103 

1.72 

X 

103 

1.67 

X 

103 

1.66 

X 

103 

1.72 

X 

103 

20 

2.55 

X 

103 

2.55 

X 

103 

2.45 

X 

103 

2.53 

X 

103 

2.55 

X 

103 

30 

3.12 

X 

103 

3.03 

X 

103 

2.99 

X 

103 

3.01 

X 

103 

3.11 

X 

103 

k 

a(Ak)  (LTBP1000) 

1 

3.41 

X 

102 

3.41 

X 

102 

3.41 

X 

102 

3.41 

X 

102 

3.41 

X 

102 

10 

1.72 

X 

103 

1.72 

X 

103 

1.72 

X 

103 

1.72 

X 

103 

1.71 

X 

103 

20 

2.58 

X 

103 

2.58 

X 

103 

2.59 

X 

103 

2.59 

X 

103 

2.59 

X 

103 

30 

3.18 

X 

103 

3.18 

X 

103 

3.18 

X 

103 

3.18 

X 

103 

3.18 

X 

103 

Table  6:  Processing  time  (sec.)  for  the  Wikipedia  dataset. 


k 

IC1000 

ICBP1000 

LT1000 

LTBP1000 

1 

6.63  x  102 

1.91  x  101 

5.41  x  102 

5.17 

10 

1.94  x  105 

1.74  x  102 

9.60  x  104 

4.64  x  101 

20 

4.82  x  105 

3.42  x  102 

3.03  x  105 

8.57  x  101 

30 

8.03  x  105 

5.10  x  102 

5.69  x  105 

1.21  x  102 

using  M  =  1,000  by  necessity  completely  depends  on  how  the  influence  set 
of  Ak  is  evaluated  by  chance  for  each  v  G  V  \  Ak-  Therefore,  we  believe 
that  the  proposed  method  outperforms  the  conventional  method  in  the  case 
of  M  =  1,  000  for  our  network  datasets. 

Here,  to  explain  the  point  of  the  reason  described  above  more  clearly,  we 
consider  the  following  method  as  an  extended  version  of  the  conventional 
method: 

1.  for  m  =  1  to  M  do 

2.  Find  the  set  D(Ak)  of  active  nodes  at  the  end  of  the  random  process 
of  the  IC  or  the  LT  models  for  initial  active  set  Ak  by  simulation. 

3.  for  each  v  €  V  \  Ak  do 

4.  Find  the  set  D(v)  of  active  nodes  at  the  end  of  the  random  process 
of  the  IC  or  the  LT  models  for  initial  active  set  {u}  by  simulation. 
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5. 


\D(Ak)UD(v)\. 
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5.  S©t  ^ 

6.  end  for 

7.  end  for 

8.  for  each  v  €  V  \  Ak  do 


9.  Set  a(Ak  U  {u})  (1/M)  £m=i  x 

10.  end  for 


The  extended  method  should  improve  the  conventional  method  because  the 
influence  set  of  Ak  is  now  equally  evaluated  for  all  v  €  V  \  Ak,  and  should 
be  comparable  to  the  proposed  method  in  quality  of  solution.  However,  it 
cannot  be  as  efficient  as  the  proposed  method  since  it  does  not  incorporate 
the  SCC-finding  technique. 


Figure  6:  Processing  time  difference  r(k)  between  the  proposed  and  conven¬ 
tional  methods  for  the  blog  dataset  in  the  case  of  the  IC  model. 

Next,  we  discuss  the  sources  of  the  difference  between  the  proposed  and 
conventional  methods  in  processing  time.  Note  that  we  use  the  same  value  of 
parameter  M  for  both  methods.  Let  Ti(fc)  and  ro(fc)  respectively  denote  the 
processing  times  of  the  proposed  and  the  conventional  methods  for  obtaining 
solution  Ak+ 1  when  solution  Ak  is  given.  We  define  the  processing  time 
difference  r{k)  by  ro(fc)  —  t\  (k)  for  k,  the  number  of  nodes  selected.  We 
believe  the  essential  sources  of  speed-up  in  the  proposed  method  is  that  we 
compute  {\F(Ak  U  {u};  Gr)|;  v  G  V  \  Ak}  on  graph  Gr  as  follows: 
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•  By  first  identifying  F(Ak'Gr),  we  reduce  the  graph  in  question  from 

Gr  to  the  induced  graph  G^k  of  Gr  to  V  \  Gr ) 

•  By  decomposing  G^k  into  the  SCCs,  we  compute  |F(ylfcU{u};  Gr)|  for¬ 
mally  nodes  v  at  once. 

Namely,  we  believe  that  the  larger  the  size  of  F(Ak]Gr )  is,  the  larger  the 
value  of  r(k)  is.  Moreover,  we  believe  that  the  larger  the  sizes  of  the  SCCs 
of  graph  G^k  are,  the  larger  the  value  of  r(k)  is.  Here,  we  demonstrate 
these  characteristics  for  the  IC  model.  Note  that  the  size  of  F(Ai:;Gr) 
monotonically  increases  with  the  value  of  k.  Thus,  we  can  expect  that  the 
value  of  r(k )  also  monotonically  increases  with  the  value  of  k.  Note  also  that 
graph  Gr  becomes  denser  when  the  value  of  the  propagation  probability  p  is 
larger,  and  the  sizes  of  the  SCCs  of  Gr  also  become  larger.  Thus,  we  can  also 
expect  that  the  value  of  r(k)  monotonically  increases  with  the  value  of  p. 
Figure  6  shows  r(fc)  for  p  =  0.1%,  1%  and  10%  as  a  function  of  k  for  the  blog 
dataset,  where  circles,  squares  and  diamonds  indicate  r{k)  for  p  =  0.1%,  1% 
and  10%,  respectively.  Here,  we  used  M  =  1,000  for  both  the  proposed  and 
the  conventional  methods.  The  results  support  our  conjectures. 

6  Related  Work 

6.1  Calculation  of  Influence  Degrees 

First,  we  describe  work  related  to  the  calculation  of  influence  degrees  in  the 
IC  model.  Let  us  recall  that  the  SIR  model  for  the  spread  of  a  disease  on 
a  network  is  equivalent  to  a  bond  percolation  model  on  the  same  network, 
and  the  size  of  a  disease  outbreak  from  a  node  corresponds  to  the  size  of  the 
cluster  that  can  be  reached  from  the  node  by  traversing  only  the  “occupied” 
links.  There  are  a  series  of  work  that  uses  this  correspondence  to  develop 
a  method  for  theoretically  calculating  the  probability  distribution  of  the 
size  of  a  disease  outbreak  that  starts  with  a  randomly  chosen  node  in  the 
configuration  model  (i.e. ,  a  random  network  model)  with  a  given  degree 
distribution  (Callaway  et  ah,  2000;  Newman,  2002;  Newman,  2003),  and  to 
derive  a  condition  for  the  disease  outbreak  from  a  randomly  chosen  node  to 
give  an  epidemic  outbreak  that  affects  a  non-zero  fraction  on  the  network  in 
the  limit  of  very  large  network.  Mathematically  more  rigorous  treatments 
of  similar  results  can  be  found  in  the  work  of  Molloy  and  Reed  (1998)  and 
Chung  and  Lu  (2002). 

Next,  we  describe  work  related  to  the  calculation  of  influence  degrees 
in  the  LT  model.  Watts  (2002)  investigated  the  LT  model  on  a  network  to 
explain  large  but  rare  cascade  phenomena  triggered  by  small  initial  shocks. 
Using  the  concept  of  site  percolation,  he  theoretically  derived  a  condition 
for  the  cascade  from  a  randomly  chosen  seed  node  to  give  a  global  cascade 
that  affects  a  non-zero  fraction  on  the  network  in  the  limit  of  infinitely  large 
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network  for  the  configuration  model  (i.e.,  a  random  network  model)  with  a 
given  degree  distribution. 

The  above  mentioned  studies  focused  on  global  properties  averaged  over 
a  random  network  in  the  limit  of  very  large  size,  while  our  primary  inter¬ 
est  is  to  practically  answer  which  nodes  are  most  influential  for  information 
diffusion  on  a  given  real-world  network  of  a  finite  size.  We  also  note  that 
those  studies  dealt  with  undirected  graphs,  while  our  work  investigates  in¬ 
formation  diffusion  on  networks  represented  by  directed  graphs.  Moreover, 
the  theories  developed  in  those  studies  assumed  that  the  loop  structure  on 
a  network  of  interest  can  be  essentially  ignored  in  the  limit  of  large  network 
size.  However,  this  property  is  not  true  of  many  large-scale  social  networks, 
and  it  is  an  open  question  whether  or  not  those  theories  are  effective  for  such 
networks  (Newman,  2003).  In  fact,  the  clustering  coefficient  C  quantifies  the 
loop  structure  in  a  network,  and  it  was  indeed  observed  that  many  social 
networks  have  much  higher  values  of  C  than  the  corresponding  configuration 
models  (i.e.,  random  network  models)  (Newman  and  Park,  2003). 

6.2  Solving  the  Influence  Maximization  Problem 

The  influence  degree  function  a  is  submodular  (see,  Kempe  et  ah,  2003).  For 
solving  a  combinatorial  optimization  problem  of  a  submodular  function  /  on 
V  by  the  greedy  algorithm,  Leskovec  et  al.  (2007)  have  recently  presented 
a  lazy  evaluation  method  that  leads  to  far  fewer  (expensive)  evaluations 
of  the  marginal  increments  f(A  U  {u})  —  f(A )  (v  €  V  \  A)  in  the  greedy 
algorithm  for  A  ^  0,  and  achieved  an  improvement  in  speed.  Note  here 
that  their  method  requires  evaluating  f(v)  for  all  v  £  V  at  least.  Thus,  we 
can  apply  their  method  to  the  influence  maximization  problem  for  the  IC  or 
LT  models,  where  the  influence  degree  function  a  is  evaluated  through  the 
simulations  of  the  corresponding  random  process.  It  is  clear  that  this  method 
is  more  efficient  than  the  conventional  method.  However,  the  proposed 
method  for  k  =  30  was  faster  than  the  conventional  method  for  k  =  1  as 
shown  in  Tables  3  and  6.  Therefore,  it  is  evident  that  the  proposed  method 
can  be  faster  than  the  method  by  Leskovec  et  al.  (2007)  for  the  influence 
maximization  problem  for  the  IC  or  LT  models.  To  quantify  the  difference 
we  implemented  the  lazy  evaluation  method.  The  processing  time  for  k  =  30 
in  case  of  the  blog  dataset  was  2.12  x  103  and  8.28  x  102  seconds  for  the  IC 
and  the  LT  models,  respectively,  and  the  corresponding  processing  time  in 
case  of  Wikipedia  dataset  was  1.46  x  104  and  2.65  x  103  seconds  for  the  IC 
and  the  LT  models,  respectively.  Here,  M  =  1, 000  are  used  as  the  number  of 
simulations  (see,  Section  3.2),  and  the  values  are  rounded  to  three  significant 
figures.  From  these  results,  we  can  see  that  the  proposed  method  was  more 
than  ten  times  faster  than  the  method  by  Leskovec  et  al.  (2007)  for  k  =  30 
in  the  blog  and  Wikipedia  datasets  (see,  Tables  3  and  6). 

Beyond  the  IC  and  LT  models,  Kempe  et  al.  (2003)  proposed  the  trig- 
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gering  model  as  an  yet  another  diffusion  model  on  a  network.  It  is  proved 
that  the  triggering  model  can  be  identified  with  a  bond  percolation  model 
(see,  Kempe  et  ah,  2003).  The  proposed  method  can  be  applied  to  this 
model  because  it  can  be  applied  to  any  diffusion  model  that  can  be  identi¬ 
fied  with  a  bond  percolation  model.  The  future  work  includes  presenting  a 
large  number  of  realistic  examples  of  such  diffusion  models. 

In  this  paper,  we  have  considered  the  progressive  case  in  which  nodes 
cannot  switch  from  being  active  to  being  inactive.  However,  there  are  many 
information  diffusion  phenomena  that  non-progressive  diffusion  models  are 
required.  Examples  include  the  spread  of  posts  for  a  topic  in  blogspace 
(Gruhl  et  al,  2004).  Kempe  et  al.  (2003)  proved  that  non-progressive  case 
can  be  reduced  to  the  progressive  case.  More  specifically,  it  is  proved  that 
the  influence  maximization  problem  for  a  non-progressive  diffusion  model  on 
graph  G  in  time-limit  T  is  equivalent  to  the  ordinary  influence  maximization 
problem  on  the  layered  graph  Gt  for  the  progressive  diffusion  model,  where 
Gt  is  the  directed  acyclic  graph  (DAG)  constructed  by  time-forwardly  con¬ 
necting  (T  +  1)  copies  of  G  (see,  Kempe  et  al.  2003).  Therefore,  building 
effective  methods  for  fundamental  progressive  models  such  as  the  IC  and  LT 
models  is  indeed  important  and  crucial  for  the  non-progressive  case. 

From  a  realistic  point  of  view,  the  IC  and  LT  models  are  by  no  means  a 
complete  model,  but  are  at  best  a  simplified  and  partial  representation  of  a 
complex  reality  (see,  Kempe  et  al,  2003;  Gruhl  et  al.,  2004;  Leskovec  et  al., 
2006).  However,  in  the  field  of  sociology,  Watts  and  Dodds  (2007)  recently 
examined  the  “influentials  hypothesis”  in  the  contexts  of  the  LT  model  and 
the  SIR  model  (i.e.,  an  extended  model  of  the  IC  model),  that  is,  they 
investigated  by  computer  simulations  whether  large  cascades  of  influence 
are  actually  driven  by  influentials  or  not.  On  the  other  hand,  Even-Dar  and 
Shapira  (2007)  mathematically  studied  the  influence  maximization  problem 
in  the  context  of  another  fundamental  model  called  the  voter  model.  We  also 
believe  that  it  is  important  to  investigate  information  diffusion  phenomena 
for  the  IC  and  LT  models  (i.e.,  fundamental  diffusion  models)  to  deepen  our 
understanding  of  these  models.  The  future  work  includes  proposing  effective 
methods  for  solving  the  influence  maximization  problem  in  the  contexts  of 
various  realistic  diffusion  models. 

6.3  Applications 

As  is  easily  understood,  the  conventional  method  is  not  practical  unless  we 
rely  on  high-performance  computers  and  sophisticated  techniques  such  as 
parallel  computing  (see,  Tables  3  and  6)  to  solve  the  kind  of  problems  such 
as  influence  maximization  problem  as  addressed  in  this  paper.  In  contrast, 
the  proposed  method  enables  us  to  obtain  a  practical  solution  to  this  kind 
of  problems  on  a  single  standard  PC  in  a  reasonable  processing  time.  Thus, 
we  can  apply  the  proposed  method  to  a  variety  of  real  problems. 
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The  work  of  Watts  and  Dodds  (2007)  briefly  described  above  needs  a 
method  to  efficiently  estimate  cf(A)  and  the  proposed  method  can  readily 
be  applicable. 

As  mentioned  in  the  introduction,  the  influence  maximization  problem 
finds  many  realistic  applications.  The  most  straightforward  application 
would  be  viral  marketing.  When  we  wish  to  promote  a  new  product  (e.g., 
an  email  service  or  a  search  engine) ,  and  are  given  a  relevant  social  network, 
we  can  easily  find  a  limited  number  of  key  (influential)  persons  first  to  adopt 
the  new  product  by  the  proposed  method,  and  enjoy  the  diffusion  effect  for 
the  IC  or  LT  models  (i.e.,  fundamental  diffusion  models)  through  the  social 
network.  We  admit  that  the  diffusion  models  we  discussed  are  oversimplified 
but  still  it  is  useful  to  obtain  approximate  solutions  as  a  first  step  toward 
an  effective  marketing  without  using  classical  advertising  channels. 

The  proposed  method  has  an  application  of  different  flavor  which  is  the 
visualization  of  information  flow.  Understanding  the  flow  of  information 
through  a  complex  network  is  important  in  terms  of  sociology  and  market¬ 
ing.  We  devised  a  new  node  embedding  method  for  visualizing  the  infor¬ 
mation  diffusion  process  from  the  target  nodes  selected  to  be  a  solution  of 
the  influence  maximization  problem  (Saito  et  ah,  2008).  This  visualization 
method  is  characterized  by  1)  utilization  of  the  target  nodes  as  a  set  of  pivot 
objects  for  visualization,  2)  application  of  a  probabilistic  algorithm  for  em¬ 
bedding  all  the  nodes  in  the  network  into  an  Euclidean  space  to  conserve 
the  posterior  information  diffusion  probability,  and  3)  varying  appearance 
of  the  embedded  nodes  on  the  basis  of  two  label  assignment  strategies,  one 
with  emphasis  on  influence  of  initially  activated  nodes,  and  the  other  on 
degree  of  information  reachability. 

7  Conclusion 

We  have  considered  the  influence  maximization  problem  for  the  IC  and  LT 
models  on  a  large-scale  social  network  represented  as  a  directed  graph  G  = 
(D,  E ).  Due  to  the  computational  complexity,  the  greedy  search  algorithm  is 
the  only  practical  approach,  but  still  the  conventional  method  needed  a  high 
amount  of  computation.  We  have  proposed  a  method  of  efficiently  finding 
a  good  approximate  solution  to  the  problem  under  the  greedy  algorithm. 
In  particular,  in  order  to  improve  the  computational  efficiency,  we  have 
estimated  all  the  marginal  influence  degrees  {a(A  U  {u});  v  G  V  \  A}  of  a 
given  target  set  A  in  the  following  way: 

•  We  identify  the  IC  and  LT  models  with  the  corresponding  bond  per¬ 
colation  models. 

•  For  any  v  €  V  \  A,  we  estimate  the  influence  degree  a(A  U  {u})  of  A 
U  {u}  as  the  empirical  mean  of  the  number  \F(A  U  {u};Gr)|  of  the 
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nodes  that  are  reachable  from  A  U  {r}  on  a  graph  Gr  generated  from 
the  corresponding  occupation  probability  distribution  q(r)  of  the  bond 
percolation. 

In  particular,  we  estimate  {|F(A  U  {u};  Gr) |;  v  €  V  \  A}  as  follows: 

•  We  find  the  set  F(A;Gr )  that  is  reachable  from  A  on  graph  Gr,  and 
simultaneously  compute  {|F(A  U  {n};Gr)|;  v  €  F(A;Gr)}. 

•  We  find  the  induced  graph  G ^  of  GrtoV\  F(A ;  Gr),  and  decompose 
G f  into  its  SCCs  (Strongly  Connected  Components). 

•  For  each  SCC  SCC(u ;  G^)  of  G (it  €  V  \  F(A ;  Gr )),  we  simultane¬ 
ously  compute  {\F(A  U  {i>};Gr)|;  v  G  SCC(u;  Gy1)}. 

We  have  compared  the  proposed  method  with  the  conventional  method 
in  terms  of  computational  complexity  and  quality  of  the  solution,  and  have 
shown  that  the  proposed  method  is  expected  to  achieve  a  large  amount 
of  reduction  in  computational  cost.  Moreover,  using  large-scale  networks 
including  a  real  blog  network,  we  have  experimentally  demonstrated  the  ef¬ 
fectiveness  of  the  proposed  method.  For  example,  we  obtained  the  following 
results  for  the  influence  maximization  problem  of  size  k  =  30  on  the  blog 
and  Wikipedia  datasets  that  are  real  networks  with  about  10,  000  nodes:  In 
the  case  of  the  IC  model,  the  proposed  method  was  1800  times  faster  than 
the  conventional  method,  and  in  the  case  of  the  LT  model,  the  proposed 
method  was  4600  times  faster  than  the  conventional  method. 
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Appendix 

A  Convergence  Speed 

As  described  in  Section  4.4,  by  using  the  same  value  of  M,  both  the  proposed 
and  the  conventional  methods  would  estimate  a(v )  with  the  same  accuracy 
in  principle.  Here,  we  experimentally  demonstrate  this  conjecture. 
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According  to  the  work  of  Kernpe  et  al.  (2003),  we  set  M  =  300,000  as 
a  sufficiently  large  value  of  M,  that  is,  we  assume  that  cr(v)  for  any  v  €  V 
is  well  approximated  by  300, 000  simulations  of  the  information  diffusion 
model  (i.e. ,  the  conventional  method  using  M  =  300,000).  For  any  v  £  V, 
let  do (v\M)  and  <ri  (v;M)  denote  the  estimates  of  a(v)  by  the  conventional 
and  the  proposed  methods  using  parameter  value  M,  respectively.  For  the 
blog  and  Wikipedia  datasets,  we  investigated 


£=^ 

J2  Me  300,  0000) 

-  ai(v 

300, 000)| 

vev 

So  (M) 

= 

M)  - 

-  cr0(u; 

300,000)|, 

IV  T  . 

v£V 

£i(M) 

=  ^  E  Me 

v£V 

M)  - 

-  Me 

300,  000)  |. 

We  first  consider  the  case  of  the  IC  model.  Then,  the  value  of  £  was 
0.03  and  0.04  for  the  blog  and  Wikipedia  datasets,  respectively.  Thus,  we 
can  assume  that  the  values  of  ao(v,  300,  000)  and  <ti(u;  300, 000)  are  almost 
the  same  for  any  v  €  V. 


Table  7:  Convergence  speed  for  the  blog  dataset. 


M 

S0(M) 

Si  (M) 

100 

1.16 

1.12 

1,000 

0.36 

0.36 

10,000 

0.11 

0.12 

100,000 

0.03 

0.03 

Table  8:  Convergence  speed  for  the  Wikipedia  dataset. 


M 

£o(M) 

Si(M) 

100 

1.28 

1.23 

1,000 

0.42 

0.42 

10,000 

0.13 

0.14 

100,000 

0.03 

0.03 

Tables  7  and  8  show  the  values  of  £o(M)  and  £\  (M)  for  the  blog  and 
Wikipedia  datasets,  respectively.  These  results  imply  that  the  proposed 
and  the  conventional  methods  estimate  {<r(u);  v  £V}  with  almost  the  same 
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accuracy  for  the  IC  model.  We  also  obtained  similar  results  for  the  case  of 
the  LT  model.  For  example,  the  value  of  £  was  0.03  and  0.09  for  the  blog 
and  Wikipedia  datasets,  respectively.  For  the  blog  dataset,  the  values  of 
£o(10,000)  and  £i(10, 000)  were  0.13  and  0.12,  respectively.  Also,  for  the 
Wikipedia  datasets,  the  values  of  <?o(10, 000)  and  £i(10,000)  were  0.36  and 
0.37,  respectively.  These  results  support  our  conjecture. 

B  Fluctuation  in  Simulations  of  Information  Dif¬ 
fusion  Models 

For  each  v  G  V,  we  examine  fluctuation  in  the  number  ip(v)  of  the  final  active 
nodes  for  a  target  initially  activated  node  v  through  1,  000  simulations  in  the 
IC  and  LT  models.  Let  and  s(v)  denote  the  empirical  mean  and  the 
standard  deviation  of  (p{v)  for  1,000  simulations,  respectively.  We  define 
Jl  and  s  by  the  empirical  means  of  {fx(v)]  v  €  V}  and  (s(u);  v  G  V}, 
respectively.  For  the  blog  dataset,  jl  and  s  were  as  follows: 

IC  model  ( p  =  10%) :  jl  =  8.6,  s  =  14.3. 

LT  model:  jl  =  6.8,  s  =  14.9. 

For  the  Wikipedia  dataset,  jl  and  s  were  as  follows: 

IC  model  ( p  =  1%):  jl  =  8.1,  s  =  16.1, 

LT  model:  jl  =  12.6,  s  =  42.4, 

Here,  the  values  are  rounded  to  the  first  decimal  place.  We  can  observe  that 
compared  with  jl,  s  is  very  large.  Therefore,  we  see  that  the  number  of  final 
active  nodes  for  a  given  target  set  can  greatly  vary  for  every  simulation  in 
the  IC  and  LT  models. 
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